PacBio Sub-read Simulations
After my previous PacBio subread experiments I wanted to build out a basic simulation to help confirm my understanding of the process and see how subread accuracy impacts final CCS/HiFi accuracy.
First I wanted to gather statistics for the raw subread dataset. I aligned everything and run the results through BEST1:
I then used these to parameterize a very basic simulator. This wasn’t too difficult, but you need to get the SAM/BAM metadata right2 or CCS will fail. I also null’d out the pulse data as previous experiments suggest that CCS doesn’t use this anyway. My simulated reads also all have a 6 subreads3.
I ran CCS over the simulated results using an error profile derived from the above. I didn’t break out insertions and deletions by homopolymer/Non-HP (perhaps another time). But the results matched reasonably closely: