Phred score detector

When using old sequence data you have to know what phred score system it is using. Almost all new data is going to use Phred33, but the old stuff could also be Phred64. There are scripts on this website that have ways of automatically detecting it http://wiki.bits.vib.be/index.php/Identify_the_Phred_scale_of_quality_scores_used_in_fastQ

I took the perl script from this website and parred down the output so that it works in pipelines. It outputs either “phred33” or “phred64”

fastq_detect_minimal.pl

An example pipeline in bash is:

coding=”$(perl $bin/fastq_detect_minimal.pl ${name}_1.fq)”

java -jar trimmmomatic-0.32.jar -$coding sample1.fq ….etc

NOTE: Phred64 should only go up to 104. There are some RNA seq samples (and probably others) that go up to 105. The original script output an error, while my version just says phred64. I hope that running it through phred conversion in trimmomatic fixes the issue but I am not sure.

SnoWhite Tips and Troubleshooting (Thuy)

Snowhite is a tool for cleaning 454 and illumina reads.  There are quite a few gotchas that will take you half a day to debug.  This wiki has a lot of good tips.

Snowhite invokes other bioinformatics programs, one of them being TagDust.  If you get a segfault error from TagDust, it may be because you are searching for  contaminant sequences larger than TagDust can handle.  TagDust can only handle maximum 1000 characters per line in the contaminant fasta file and maximum 1000 base contaminant sequence lengths.

A segfault (or segmentation fault) happens when a  program accesses the wrong piece of memory.  After TagDust hits the 1000 line character/sequence base limit, TagDust keeps trying to access memory past the 1000 memory slots it has allocated.  It may try to access non-existent memory locations or off-limits memory locations.  You need to edit the TagDust source  code so it allocates enough memory for the sequences and does not wander into bad memory locations.

  • Go into your TagDust source code directory and edit file “input.c”.
  • Go to line 68:

char line[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of characters in the longest line in your contaminant fasta file.  You probably can skip this step if you are using the NCBI UniVec.fasta files, since the default of 1000 is enough.
  • Go to line 69:

char tmp_seq[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of bases in the longest contaminant sequence in your contaminant fasta file.  I tried 1000000 with a recent NCBI UniVec.fasta file and it worked for me.
  • Recompile your TagDust source code
    • Delete all the existing executables by executing  make clean in the same directory as the Makefile
    • Compile all your files again by executing make clean in the same directory as the Makefile
    • If you decided to allocate a lot of memory to your arrays, and your program requires > 2GB of memory at compile time, you may run into “relocation truncated to fit: R_X86_64_PC32 against symbol” errors during linkage.  This occurs when the compiler is unable to allocate enough space for the program’s statically allocated objects.  Edit the Makefile so that

CC = gcc
becomes
CC = gcc -mcmodel=medium