What length reads to you need for viral sequencing?
In a previous post I talked about the read length requirements for viral diagnostics. There I was making the case that you can use very short (25bp) reads for this application. My suggestion was that by using shorter read lengths, you can perform sequencing based diagnostics as cheaply and effectively as qPCR.
But the other day someone asked me a slightly different question:
Aren’t SARS-CoV(-1) and SARS-CoV-2 pretty similar? Wouldn’t that cause a problem?
My instinct here is that sure, at the protein level there’s probably strong similarity. But at the nucleotide level (which is what we’re interested in when sequencing, or doing primer design). Not so much…
But I wasn’t satisfied with this answer decided to run the numbers. Most of this was performed on a Sunday afternoon while watching TV with the kids, so there may be (hopefully minor) errors. The high level answer is that about 90% of the SARS-CoV-2 genome is unique as compared to SARS-CoV-1 using reads of 25bp or more. There are two long nucleotide identical matches of 117 and 104bp and a bunch of smaller ones.
That’s the high level summary, but for fun I’m going to look into the repeat structure of these genomes a bit more deeply.
Way back during my PhD I looked at the repeat structure in sequences. I developed an approach to visualizing the structure called the “repeat score plot”.
Here is the “repeat score plot” for the dialogue and stage directions in the complete works of Shakespeare from the original paper:
I’m not going to explain this in detail here. But the exponential behavior you see on the left of the heat map is what you’d expect from a random sequence. The structured “hairy” things and lines represent repeated substring. In this case, stage directions are significantly more repetitive than the dialogue, as you’d expect.
The approach reveals all sorts of interesting structure associated with genomic sequences, here showing striking differences between coding and non-coding regions:
It was a fun exercise, but not the focus of my PhD and I didn’t really take the work much further. But this seemed like a chance to dig out the code which I last looked at in 2014. Shockingly it still seems to work. Here’s what it looks like on SARS-CoV-1,2 and the combined sequence:

For the most part, SARS-CoV-1 and 2 show random structure. But the large diagonal feature is a conserved repeat. Rather boringly, in this case these repeats are caused by the polyA tail at the end of the sequence.
Looking at part C which combines SARS-CoV-1 and 2s sequences, you can see that the “repeats=1” line extends further along the X-axis. These are the repeats that occur between SARS-CoV-1 and SARS-CoV-2. So at “read” (sub-string) length 25 we have 949 repeats between SARS-CoV-1 and SARS-CoV-2. Many of these result from a small number of larger duplications.
You can find all these in the matches file on GitHub. I wrote a small program to print exact matches between these two sequences, and manually pulled out the results. I trimmed down the SARS-CoV-1 and SARS-CoV-2 genomes, you end up removing about 10% of the genome in total for matches >25bp.
The longest exact match was this sequence:
AATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAA
As far as I can tell this is part of the 3’UTR:
There’s some research discussing the conservation of this region here. And I’m sure nucleotide similarity has been explored in depth.
For diagnostic purposes, the impact of these duplications should be minor. 1 in 10 target reads may give you non-specific information, but it’s unlikely that all target reads will.