Thuy and I have made several perl scripts to trim Illumina reads, available on the zoology cluster at:
The perl scripts are currently quite functional as long as you have paired end reads – the current versions don’t handle single ends yet. The first file (…Qual20.pl) trims everything with quality below 20 from the ends of the sequence. You can change line 83 and make a new version if you want to filter differently (let me know if you want help).
ALSO, it turns out that many of our new runs are, due to the BAM to fq conversion, in a version of Phred quality scores, “Phred+33”, rather than the “Phred+64” usually used by Illumina. Thank the GSC once again for only distributing BAM files rather than fq or qseq… SO, for those files or any other in this version (or the somewhat similar Sanger format quality scores – you can tell both of them because the low quality ends are ####### rather than BBBBBB) you should use the alternate version:
Our trimming strategy for these scripts was based on some of the publicly-available perl scripts for trimming qseq files. I hadn’t seen these scripts adapted to trim fastq, so we did that as it is an important thing to do for de novo assemblies. The script could certainly be improved in various ways, and feel free to do so (and rename your version if so and send a note around). Currently it requires an input directory containing only paired fastq sequences with standard Illumina quality scores. I know we need to modify to handle single ends, but that will be an easy fix – I’ll do that eventually when I need it, or you can let me know if you want me to do that if you need it right away.
nohup perl /Linux/Loren/Seq/trimIlluminaFqQual20.pl INPUTDIR OUTPUTDIR > logfile &
again, the input directory has to have only paired end illumina reads, one read per end ending in _1.fq and the other ending in _2.fq
Of course, you don’t have to nohup it or send to a log file, but I think both are good practice. And, the log file is useful if you want to know how many adapters are trimmed, how many reads are trimmed at all, etc. To count the number of trimmed ends:
grep -c ‘TRIMMED’ logfile
To get the number of reads with adaptor sequences:
grep -c ‘adaptor’ logfile
It seems to be working very well as far as I can see, but let me know if you find bugs. Any bugs are likely due to my modifications rather than Thuy’s programming.