Using GPT to Classify Fastqs

May 31, 2024

Summary: Seems like you can determine sequencing platform from the quality scores alone using a GPT model.

After my previous post on classifying BAMs using CIGAR strings I decided to try a different approach. One significant issue using the CIGARs the that the files need to be aligned. While this could still be useful, it means you need to know what’s in them prior to classification.

Previously I’d tried using the nucleotides alone to determine the sequencing platform used. This was a total failure. I assume this is because any signal coming from the platform itself is largely masked by “good” unerror’d bases. And here we’re more interested in errors which might be characteristic of the platform used.

So, this time I decided to look at the quality scores only. With the small model I’m using (10M parameters in some hacked python code which is essentially nanoGPT). This appears to work well!

Not only can the classifier tell the difference between Illumina, Revio, Subread, and ONT. It can also distinguish between Illumina read 1 and read 2. As before, the input looks like this. A short identified (here I2 and I1 for “Illumina Read 2/1” with the quality string in between:

I2 DD?DDIH0FHGHHHHHIHIHIEHCHIEHHHGHIIIGHIIIIIHIIIGIIIIGIIGGH@H?HFHHHCHHIIHDCG11DEHIHC?HHIIICD0D?GFHHHIHIGHHEHHHHEHHDH<FHGIHIHIIHIEHHHHHIHCFHHHHIHI<EHHHEEEHHCGHHHIII.AHIIHIHHE.FAGAFHHDC<-BF@.BA?GGCGHD-8FE-@---@?@FCHHFH-@@?--66-6-@-FGHH-@-6--6---4,3>5+-6 I2
I1 DDD?@GGHHIIFEHCC@?1<FFGFHIIIIEEHIHFF1ECGGHH?GHIIHIIC@FGHIIIIHH@CCGHEH?HH<DEHIHIHHHHIEHHHIHHE?DGHHD1CHFHHGEHHHHHHHIIHHECC@FHHIIH?FHIIIIIIICHHIHIC?1FGEHEC?GEHHCFHHGCHIIGH@@@?@CG1<GEEECCEH0<...<CFHIIEF/<ECFHHH//CE/9FCCE7@/F/:@G/7CF/:77,BAH..9A.-86@AGB6 I1

After training I throw a subset of test quality strings through the model followed by a space and ask it to predict the subsequent characters (representing the platform). We can grab these quality string from a BAM like so:

samtools fastq /mnt/sdb1/sequeldata/IIe/NA09301.bam | awk 'BEGIN{n=0}{if(n%4==3) print $0;n++;}' | head -n 1000000 > ./test

The evaluator throws 100 quality strings through the model for a test fastq file. After this it spits out the most popular prediction. The platform types should probably be single tokens… and this often seems to cause the some confusion. But it works well enough for a quick experiment.

The model was able to determine the correct platform based on the platform alone in all the examples I tested. Would be interesting to try different ONT platforms, things like Illumina 2 color versus 4 color chemistries. This I imagine would be more challenging.

But it feels like the basic approach works and I’m curious to see how much further this can be pushed! Perhaps more to follow!

ASeq Newsletter

Discussion about this post