Plastid genome assembly from low coverage WGSS data

This post describes the steps I took to assemble plastid genomes from low-coverage WGSS data. An overview of the approach can be found here.

Essentially, the method involves first mapping of quality-filtered reads to a reference plastid genome, and only selecting plastome-like reads from this mapping step for subsequent de-novo assembly. For the assembly step, I used the VELVET assembler, which performs well for small genomes and is quite fast.

1) Reference-based enrichment of plastome-like reads.

I used Bowtie2, Picard tools, and a H. annuus plastome reference with only one copy of the IR. Because of the high coverage I had for all samples for the plastid genome (around 30 to 60 x), I only selected reads for which both mates of a pair mapped uniquely, and satisfied the paired-end constraints. I also did not allow reads of a pair to be overlapped.

1.1) Create the Bowtie index:

 <path_to_Bowtie2>/bowtie2-build plastid_noIR.fa plastid_noIR

1.2) Alignment:

<path_to_Bowtie2>/bowtie2 –p 8 –no-unal –no-discordant –no-mixed –no-contain –x plastid_noIR -1 sample_name.1.fastq -2 sample_name.2.fastq –S sample_name_cp.sam &>map.stdout

1.3) Extract mapped reads from .sam file:

java -jar <path_to_picard_tools>/SamToFastq.jar INPUT=sample_name_cp.sam FASTQ=sample_name_cp_R1.fastq SECOND_END_FASTQ=sample_name_cp_R2.fastq

2) Prepare .fastq files for VELVET

VELVET only handles interleaved fasta and fastq files, where each pair of reads is listed one after the other. VELVET comes with the Perl script that cam make these interleaved files:

<path_to_script_directory>/shuffleSequences_fastq.pl sample_name_cp_R1.fastq sample_name_cp_R2.fastq sample_name_cp_shuffled.fastq

3) Run VELVET assembler

You can run VELVET with different k-mer sizes, check the assembly results, and determine optimal k. By checking the stats.txt file in the output directory, you can also adjust the coverage cutoff for the assembly. After a few trial runs, I ended up using a k of 21 and a coverage cutoff of 15.

3.1) Run velveth:

<path_to_VELVET>/velveth  Output_k21/ 21 –fastq –shortPaired sample_name_cp_shuffled.fastq

3.2) For the next step, velvetg uses information on the insert length for each library. I obtained this using the .sam file from step 1.2 and this script (which I obtained at the workshop I attended last year; you can also find it online). The script takes as input the .sam file and summarizes the insert length from all read pairs that mapped to the reference.

<path_to_script_directory>/get_insert_sizes_from_sam.pl sample_name_cp.sam > sample_name_cp.sizes

I then calculated the mean insert size and SD from each .sizes file, using R:

sizes <- as.numeric(readLines(“sample_name_cp.sizes”))

mean(sizes, na.rm=TRUE)

sd(sizes, na.rm=TRUE)

3.3) Run velvetg with optimized insert size information:

<path_to_VELVET>/velvetg  Output_k21/ -ins_length xxx –ins_length_sd xx –min_contig_lgth 100 –scaffolding no –cov_cutoff 15

4) Order and merge the resulting contigs

For this step, I used CodonCode aligner (although I’m sure other similar software would work just as well). I aligned the VELVET contigs to the H. annuus plastome reference, and merged the overlapping sequences.

5) Validate the assembly

I used Bowtie2, mapping the quality-filtered reads to the draft assembly:1.1) Create the Bowtie index:

<path_to_Bowtie2>/bowtie2-build plastid_draft.fa plastid_draft

1.2) Alignment:

<path_to_Bowtie2>/bowtie2 –p 8 –no-unal –x plastid_draft -1 sample_name.1.fastq -2 sample_name.2.fastq –S plastid_draft_cp.sam &>map.stdout

Finally, using a genome browser (such as Tablet) I visually inspected the alignment to check for any assembly errors (spikes or drops in coverage).