Generating CCS Reads Better Than PacBio? (Probably Not)
In previous posts I’ve used simulated subreads at various accuracy levels and run them through PacBio’s CCS tool to see how well it would cope with them.
I decided to try another approach and throw subreads through a multiple alignment tool and generate a consensus. I built by approach around Muscle… a public domain multiple alignment tool.
The results are shockingly good. Simulating 5% each of insertions, deletions and mismatches on 10Kb, 5 subread reads. Muscle generates a Q16 read (0.978), CCS is down at Q13 (0.953). And it does pretty well over a range of error rates:
The problems with Muscle are however not insignificant. It takes ~2 minutes to build a consensus for a single read, compared to less than a second with PacBio’s CCS. It also falls over with reads much longer than 10Kb, or with many subreads. So… Muscle certainly isn’t a practical tool for subread consensus generation.
But why is it doing better than CCS?
With the error profile of real subreads Muscle generated a Q27 (0.99810) read versus ~Q19 (0.98695) with CCS. Suggesting, that using Muscle would improve accuracy significantly.
Unfortunately this probably isn’t the case…
I figured it would be interesting to look at some individual reads to see what was going on… here’s one example of a 100bp read where Muscle does better than CCS, I’ve highlighted the 2 CCS errors:
From this it’s reasonably clear what’s happening. CCS is taking into account the tendency of the PacBio platform to miscall homopolymers. So two real Gs becomes 3 due to a single G insertion in one read. And two Ts become 3 due to a mismatch error in another.
It would be interesting therefore to see how Muscle fairs with real CCS data Unfortunately Muscle fails over with almost all real reads (which are ~20Kb). If I can find a CCS subread dataset with ~10Kb reads I’ll give it a go… but given PacBio no longer deliver subreads to users on current systems… this seems unlikely.
The code to run the simulation and consensus calling is here, you’ll need to hack around with it to work on your system. But comments and suggestions are most welcome!