A Viral Metagenomic Sequencing Paper
With apologies to new subscribers interested in teardowns of scientific equipment (you might want to check out my other substack). Today I’m going to be doing a brief followup how sequencing is more sensitive than qPCR for applications like COVID diagnostics.
Partly in response to comments on this post, I’ve been putting together a spreadsheet with all the viral metagenomic sequencing papers I can find. If you have any that I’ve missed, I’d be most grateful if you could forward them to me (new@sgenomics.org).
I’ve also been working on a model which more precisely describes the tradeoffs between qPCR and sequencing. Comments on this are also welcome, but you can expect a more detailed post on the literature list and model soon.
Most of the papers in the literature list are, I think, of limited interest. They use too few reads (in some cases individual samples have <10K reads), or have few samples.
However, someone sent me a recent publication which I found interesting, and that’s what we’re going to look at today.
JumpCode’s Paper
This paper comes from JumpCode who are working on a novel CRISPR based rRNA depletion method. The depletion method isn’t as interesting to me as the dataset itself. They used a total of 72 samples, and for most samples had >40M reads per sample (all samples where qPCR Ct was <35).
Their conclusion is that they “show sensitivity of pathogen detection equivalent to RT-qPCR”. I wanted to dig into this a little more. The paper discuss data from two sites. Here I’m going to go through the data from “site A”, described in the table below:
Using their approach we see one false negative from sequencing. It should be noted that each sample appears to have had two technical replicates. Only one of these failed. It should also be noted that they only report samples with a qPCR Ct <35 in the table above. The accuracy/value of qPCR Ct’s >35 is debatable, and this seems reasonable to me.
But what about this false negative? Isn’t that a concern? Doesn’t this show that sequencing is less sensitive than qPCR?
I don’t really think so, I think it rather reflects the detection threshold used:
“Thresholds for detection were determined empirically and defined as follows: genome breath coverage >=3%, and number of uniquely aligned reads >=20 per 40M read pairs sequenced”
The paper provides Ct values and read counts in the supplementary information. We can extract these and graph Ct against read count, which shows a nice correlation (you can find this in the modeling sheet if you want to play):
There’s some reasonable fraction of reads in all samples with a Ct <35. So for some reason one of these samples didn’t meet the empirically determined threshold above.
Most likely this was one of the higher Ct samples, and I suspect it failed to meet the genome coverage requirement. Coverage seems to have been calculated from a strict subset of 40M subsampled read pairs. So it maybe that there are a few more reads included in the RPM calculation, which were not used to determine SARS-CoV-2 genome coverage.
Now, we could relax the coverage threshold and turn this sequencing false negative into a positive. But this most likely begins to complicate the story told in the table above.
If you expand the graph to include qPCR negatives, giving them a Ct of 45 you see the following:
These are potentially qPCR false negatives, but the true source of the reads here is an open question. It could be contamination during prep, or index hopping. I suspect at least some (as we’ve seen in the literature) are qPCR false negatives.
So it seems likely that setting the threshold lower would remove the sequencing based false negatives, but result in more qPCR false negatives and complicate the story told in the paper.
The read count/Ct correlation shown in the JumpCode paper is pretty typical of what’s shown elsewhere in the literature:
As you can see they are in general pretty noisy, that is if you look at a single Ct value you have what would be an equivalent of ~3 Ct’s of noise on the sequencing fraction (3 Ct is roughly an order of magnitude).
All this makes me ponder upon the dataset I’d actually like to see to further investigate the limitations of sequencing for metagenomic viral diagnostics. In particular, taking indexing hopping out the equation by performing single sample runs seems like it would be valuable. A spike in (used in some publications) may also help normalize viral fractions against the total volume of material used.
This of course would increase the cost of the study. But I think it’s possible and preferable to build instrumentation to process single metagenomic samples at point of care, and at low cost.
Overall, based on this paper, this baseline sensitivity of sequencing seems at least as good as qPCR (and I believe better if implemented correctly). But stay tuned for more comprehensive thoughts!