Would A Better Nanopore Help With Protein Sequencing?
The other day someone asked me if a better nanopore might help with protein sequencing. Resolution is clearly one of the issues with sequencing proteins using nanopores. A recent paper suggested ~20 amino acids sit within the sensing region of CsgG (the pore used on some ONT flowcells). And as previously discussed, this would create a huge state space likely making single amino acid accurate protein sequencing extremely challenging, if not impossible.
But what if we did have a better pore? So far we’ve failed to find one for DNA sequencing (or demonstrated sequencing on a solid state pore). But if we could find one, would amazing new pore then enable high accuracy protein sequencing?
Let’s try and run the numbers.
Let’s first define our problem. There are 20 basic amino acids. But on top of this there are “a bunch” or post-translation-modifications (PTMs). How big a deal is this? Well this paper says “virtually all proteins in eukaryotes undergo PTMs”. It goes on to describe 3 common PTMs. So, let's take this as a baseline 20*3 = 60 in our expanded alphabet.
Our theoretical pore has single base resolution. We’ll assume we have about 20pA of range to work with (this seems about right based on what we’ve seen in DNA sequencing and other protein sequencing experiments).
So with 60 states we have 0.333pA between state (~300 femtoamps).
Dwell time considerations will remain the same, so let’s assume a 10ms (as in the paper) dwell. They uses a 3KSPS sampling rate.
We’ve certainly got enough electrons per sample now (1000s!). But can we practically achieve this? The raw signals from the protein sequencing paper seem to show 2 or 3pA of noise:
Let’s do a quick simulation (code below)! We’ll take 30000 samples (10s at 3KSPS), then average over blocks of 30 (the dwell of one amino acid at 10ms) to create a new trace here’s what happens to trace with ~2pA of p-p noise:
Of course there’s also an analytical method to work this out… but this is easier for this poor computer scientist. Anyway… looks like we have ~0.5pA of p-p noise. We’re in the right ballpark, but it’s pretty borderline.
We’ve also pushed ourselves into a corner, where we’re axiomatically assuming that machine learning won’t be able to extract much more information from this signal.1
Overall, the “ideal nanopore” might help with protein nanopore sequencing, but even here it seems like a fairly tough challenge to get high accuracy protein sequences. Being able to ignore PTMs would help somewhat. But practically speaking, the approach could well end up limited to protein fingerprinting (like many other “protein sequencing” approaches).
I threw this post together fairly quickly so errors are entirely possible! Thoughts? Comments, suggestions? Why not reach out on the Discord?
import numpy as np
import matplotlib.pyplot as plt
# Central value
mean = 5.0
# Standard deviation of the Gaussian noise
std_dev = 0.5
# Number of samples
time = 10 # seconds
sample_rate = 3000
num_samples = sample_rate*time
rate = 100 # number of AAs per second
samples = np.random.normal(mean, std_dev, num_samples)
# Print the generated samples
print("Generated samples:", samples)
# Specify the window size for the boxcar average
window_size = int(num_samples/time/rate)
# Calculate the boxcar average
boxcar_average = np.mean(samples.reshape(-1, window_size), axis=1)
print("win ", window_size)
print("bar ", len(boxcar_average))
# Create a plot
plt.figure(figsize=(10, 6))
plt.plot(samples, label='Samples with Gaussian Noise', alpha=0.5)
plt.plot(np.repeat(boxcar_average, window_size), label=f'Boxcar Average (Window Size = {window_size})', color='red')
plt.xlabel('Sample Index')
plt.ylabel('Sample Value')
plt.title('Generated Samples with Boxcar Average')
plt.legend()
plt.yticks(np.arange(3, 7, 0.5))
plt.grid(True)
# Show the plot
plt.show()
As we’re at single base resolution already, so we can’t do anything to incorporate longer range effects, because there aren’t any.