Oxford Nanopore Simplex Overall Error Rate ~3%? (Best model - New dataset)
After my previous posts on ONT Simplex data I received some great feedback in the Discord channel. So I decided to go search for a different dataset to see if we can get better results.
Luckily the Pangenome project now has ONT Duplex data and also provides Simplex calls for this dataset. It should be noted that they say this about simplex data:
“[simplex[ *.fastq.gz (used for duplex calling, not recommended to use as inputs to analysis)”
But they are simplex calls using the sup model and I assume this just means “duplex is better”1. So let’s take a look anyway! I grabbed the first file on the list for HG002…
Aligning these again using Dorado aligner (against the same combined reference used previously) shows a trend similar to that shown in our previous analysis (for fast calls):
The yield graphs look like the following with a little over half the bases over Q20:
The overall run statistics are as follows:
This puts the overall accuracy at 97.2%, or a 2.8% error rate. Roughly 3x higher than the original Illumina Genome Analyzer (released in 2007). This is a long way below the claim on the Oxford Nanopore website of Q23 (0.5% error rate), which would be ~6 times lower than in the results above.
This may be because in general Oxford appear to use modal per-read accuracy. A metric that can be easily biased and no other vendor uses. Unfortunately I don’t see any vendor claim on per-base accuracy here… a metric well understood2 for all other platforms.
Filtering
The other question to consider is filtering. Both the fastqs previously used and the BAM processed here are I suspect the complete dataset. A number of users suggested filtering. Looking at a subset of “passing” sup basecalls provided in previous dataset3, seems to show these get to about Q19 (~1.25% error rate). This still a long way from the claimed “Q23” (0.5% error rate) but it’s a big improvement.
There’s a throughput cost here. Going by the file sizes this cost is ~10% which isn’t too bad! What would the cost be to actually get to Q20 or Q23? That’s less clear…
This leads us to another issue with the ONT specs. Total run throughput only ever seems to be listed as “maximum theoretical”4 with no strong statement about the actual number of bases a user can expected to receive (or where runs will be refundable5). This number would need to be stated under a specific filtering regime to make sense.
Duplex
The Duplex reads called from this dataset seemed to have a significantly lower error rate ~0.44% (just over Q24). Again, to me this doesn’t really match the Q30 claim but it’s pretty good!6 Of course you’re taking a throughput hit here. The Duplex BAM was 7Gb and the simplex BAM 42.2Gb. This suggests that you can expect ~14% of the normal run throughput from a Duplex called run.
If the Duplex rate can be pushed up, it seems like a decent option. And in my opinion, even with the throughput hit, I’d probably take duplex over simplex data at this point (depending on the application).
Final Thoughts
You should do your own analysis if you want to solid figures to use in your own work7 (and please post the results on the Discord!). If accurate and generalizable8 the results here don’t mean the ONT platform doesn’t have many interesting use cases… Just the accuracy isn’t as high as other platforms. And in my view the vendor website doesn’t really help give an accurate impression of expected performance…
Stay tuned for more error metrics (including hopefully some Ultima data)! I’ve also posted an Illumina run on the second substack.
Datasets/commands used here:
wget https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE--UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/simplex/11_15_22_R1041_Duplex_HG002_10_Dorado_v0.1.1_400bps_sup.fastq.gz
/mnt/sde1/dorado/cmake-build/bin/dorado aligner -n 24 ../HG002.combined.fa.gz ./11_15_22_R1041_Duplex_HG002_10_Dorado_v0.1.1_400bps_sup.fastq.gz > 11_15_22_R1041_Duplex_HG002_10_Dorado_v0.1.1_400bps_sup.bam
/mnt/sde1/HGP002/best/target/release/best input.bam reference.fasta prefix/path
But consider yourself warned!
Well understood, by me and to my satisfaction.
After I finally figured out how to extract the data from the crams.
I can only see “TMO” here. Of course throughput will vary with experiment type to some extent. But other vendors don’t seem to have trouble providing figures that in practice represent what the customer will achieve. If necessary, you can list throughput for a given set of experimental conditions.
That is refundable, based on throughput not active pore count at run start or some similar metric. Perhaps this is particularly problematic for nanopore sequencing. But if so, it makes planning runs and estimating costs particularly difficult here.
And maybe by tweaking the alignment and applying more filtering that could be pushed up?
These are random public datasets. So all the thoughts I previously posted here apply.
Which is limited by the datasets available, and approaches used.