Illumina's Long Read Prep

Jan 27, 2022

At JPM Illumina announced a couple of methods to push up read length. The first is “Chemistry X” which boasts “two times longer reads”. Which if we take at face value, would mean 2x600bp reads on a Miseq. But I suspect may mean rolling out 2x300bp reads on the Novaseq.

If I can dig anything up on “Chemistry X” I’ll put together another post. But here, I’m going to review their “long read workflow”, codenamed Infinity. Infinity, they say will deliver contiguous data up to 10Kb in length.

Infinity appears to be sample prep based approach, which they say can be “fully automated” and is compatible with current instruments.

Illumina are of course not the first to market a long read workflow to run on their platform. 10X originally marketed a “Linked-Read” approach, where short reads from the same longer fragment would receive the same barcode. This approach didn’t provide contiguous long reads but could provide longer range information for use in assembly and other applications. BGI also have a “long fragment read” approach on the market.

While I’m sure these approaches have been useful in some applications, I suspect the 10X approach was largely a commercial failure and 10X withdrew the product in 2020. Long range information alone doesn’t seem to be hugely compelling. So whatever Illumina come up with, I expect to be complementary rather than revolutionary. But no doubt some customers considering using ONT or PacBio to augment their Illumina data will choose to use this prep instead.

Illumina appear to have acquired at least one (and possibly two) companies developing long read workflows; Moleculo, back in 2013 and more recently Longas Technologies.

While I can’t find an official report on the Longas acquisition, some have suggested that Illumina has acquired the company and their website appears to have been taken offline in late 2021. A search through LinkedIn suggests that the majority of Longas employees now work for Illumina.

The Longas approach appears to be a development of “Sequencing analysis by mutagenesis” (SAM) a technique originally proposed in 2004, and demonstrated in silico for NGS in 2012.

In principal the procedure is fairly simple. You take long input DNA fragments and apply mutations using mutagenic PCR. These mutations give each fragment a kind of “fingerprint” beyond its original sequence content. This mutational fingerprint can then be used to overlap short reads and group them by source fragment. These groups are then assembled into longer reads (~10Kb seems to be most commonly proposed).

This “mutational fingerprinting” of course has the obvious drawback of introducing error. In the 2012 paper, these were up to 10%. To mitigate this you either need to create multiple different fingerprinted molecules from the same starting template, or use overlapping long input fragments.

Both these mitigation approaches will have disadvantages over direct, single molecule, long read sequencing where every fragment read occurred once in the sample. For the bulk of applications however, this is likely a minor drawback.

The second issue is throughput. Paired read error rate reduction approaches (like Twinstrand) result in reduced throughput as you can’t easily control how many reads you obtain from each template. This should be less of an issue for the Longas approach as applied to de novo genome assembly, where in any case, very high coverage is used. I suspect you end up running your “long read” coverage at 4 or 5x. But still needing to generate effectively 30x of sequencing data.

In 2019 Longas stated that "About 92 percent of the reads are above Q30". This is similar to the Novaseq baseline error rate. Their approach, in generating a consensus “long reads” from short ones, likely corrects some sequencing errors, while also introducing a few during mutational PCR. What’s less clear to me is how much data will be thrown out as it has insufficient coverage. I would expect some, perhaps modest, percentage of lost/unpaired data.

It’s not entirely clear why the approach is limited to 10Kb reads. However in their 2012 paper Goldman et al. also discussed 10Kb fragment lengths, as being in part limited by sample handling. And without special handling, other long read platforms are similarly limited to a few 10Kbs.

It will be interesting to see how widely deployed the Illumina/Longas approach is. Longas mentioned that "there has been most interest from research groups looking at microbial genomes, including the microbiome and environmental isolates, and cancer genomics.". And of course having long range information available should help polish de novo genomes.

However, given the prior art going back as far as 2004. I suspect that if the approach does prove compelling, it will be possible for other companies to produce generic kits. So overall I don’t see this workflow as giving Illumina much of an edge over Singular and other new players.

ASeq Newsletter

Discussion about this post