Chris and I devised and implemented this method to identify contaminant DNA in our HA412 454 and Illumina WGS reads.
N.B. I will update this posts with links to the scripts once I get them up on the lab’s bitbucket account, plus extra details if I think they are necessary on review.
- BLAST Assemblies Against NCBI NT Database
We identified unique taxa, and fetched their lineages from NCBI using heyaudy.com (bitbucket:wgetLineage.pl). We kept hits which matched with non-plant taxa. (bitbucket: blast.sh)
- Assessment of Contaminant Hits
We identified contaminant hits with a homology of >95%, and used these to identify several organisms/genera from which all of the contamination came.
- Genome to Genome BLAST
We then blasted the relevant genomes, plus the UniVec database (a database of artificial sequences) against the assemblies of interest (bitbucket: blastwithgenomes.sh, blast.univec.sh, scaffoldstats.pl), and extracted hits based on the following criteria:
“Blacklist” sequences had >=98% homology, and > 100bp (length was not used to filter UniVec sequences), plus, no plant hits with an overlap of > 20bp.
“Greylist” sequences needed >85% homolgy and the same length criteria, but were allowed for overlapping plant hits, as long as the bit score for the non-plant hit was greater than the plant bitscore.
- Contaminant Hit Sequence Extraction
We extracted sequences which fit the criteria for likely contamination from each round of BLAST. (bitbucket: extract_bad_hit_sequence.univec.pl, extract_bad_hit_sequence.genomeblast.pl, extract_bad_hit_sequence.NTblast.pl), pruned them for duplicates, and split them according to the criteria in section 3 to create the final contaminant Grey and Blacklists.