Genomics

R can be used to plot, filter, select, and analyze genome resequence data (as well as genotyping-by-sequencing, or GBS, data).

First, I show how to read a VCF file, how to select subsets of samples and SNPs, convert a VCF file to a SNP table, how to carry out principal components analysis on SNP data, and how to compute divergence metrics like allele frequency differences and Fst.

I also describe more specialized tasks such as reading a genome fasta file into R, and converting (lifting over) genome coordinates from one reference genome to another.

Finally, I describe how we get from short read fastq files to VCF files of SNP data using the GATK4 best practices pipeline. This section is less about R.

Example code below uses syntax for running software on Digital Research Alliance (a.k.a. Compute Canada), but running on other systems will be similar. Information on how to log in and run programs on UBC computers is given at the end. My examples ignore paths to files in other folders. Modify as needed.

Read me

Load modules

Most large computer systems will require you to load program modules before you can run them. The code to do so is computer system-specific. Here are example load commands for most the examples o nthis page when running them on Compute Canada (Digital Alliance) computers. Other computer systems will be similar. Program versions are indicated after the / and will need to be updated periodically.

# R and Bioconductor
module load gcc/9.3.0 r/4.1.0 r-bundle-bioconductor/3.14
# vcftools (for filtering VCF files)
module load vcftools/0.1.16
# htslib (for bgzip and tabix)
module load htslib/1.15.1
# gatk (for making and filtering VCF files)
module load gatk/4.2.4.0
# plink (for PCA)
module load plink/2.00-10252019-avx2
# beagle (for imputation)
module load beagle/5.4
# bwa (for aligning reads)
module load bwa/0.7.17
# samtools (for processing bam files)
module load samtools/1.10

VCF file

I’ll start this R tips page by assuming that you already have a clean VCF file containing only SNPs. To make a VCF file, jump to the section below on making and filtering a VCF file.

Missing genotypes

Recent versions of GATK4 code missing genotypes as 0/0, i.e., the homozygous reference genotype instead of the missing symbol, ./. This is not a good idea. We include a step in our pipeline to recode low quality genotype calls back to ./.

Install `VariantAnnotation`

This is the most important R package for processing VCF files and SNP data. It is part of the Bioconductor project. Instructions on how to install Bioconductor packages are here. For example, install VariantAnnotation as follows.

Make sure you have the latest version of R installed first. Then check the Bioconductor page for the latest version compatible with your R version. Next, install the installer if you don’t already have it. Then install the package (here, version 3.17). Other Bioconductor packages can be installed the same way.

install.packages("BiocManager")
BiocManager::install(version = "3.17", update = FALSE, ask = FALSE)
BiocManager::install("VariantAnnotation", update = FALSE, ask = FALSE)

Read a VCF file

This is typically how you’ll read a SNP data file into R.

Small VCF file

Start by reading a small VCF file or you will quickly run out of memory. You’ll need to specify which fields to read. The more fields you read, the more memory you will require. I assume that your file is gz-compressed, but `RreadVcf() will also work on uncompressed VCF files.

R will create names for the SNPs.

The R vcf object is named vcf in this example.

library(VariantAnnotation)

# VCF file name
vcfFileName <- "small.vcf.gz"

# Choose which VCF fields to read (here, the ALT alleles and the genotypes)
param <- ScanVcfParam(fixed = c("ALT"), geno = "GT", info = NA)

# Read VCF file into R
vcf <- readVcf(file = vcfFileName, param = param)

# View a few snps
rowRanges(vcf)

# Peek at the first 5 columns and rows of the SNP genotypes
geno(vcf)$GT[1:5,1:5]

# Extract the sample names
colnames(vcf)

# View the SNP names R has constructed
head(names(vcf))

Select a range of SNPs

A VCF file might be hugh, but you can select a range of SNPs to read into memory.

In stickleback, an inversion polymorphism on chromosome XXI runs from about position 9914802 to position 11605355 in v5 of the reference genome. To grab the SNPs in this interval from a VCF file, try the following.

library(VariantAnnotation)

vcfFileName <- "global.chrXXI.snp.vcf.gz"
myInterval <- makeGRangesFromDataFrame(data.frame(
                seqnames = "chrXXI", start = 9914802, end = 11605355))
myInterval
param <- ScanVcfParam(fixed = "ALT", geno = c("GT"), info = NA, 
                        which = myInterval)
vcf <- readVcf(vcfname, param = param)

# Confirm that the SNPs in the file are within myInterval.
rowRanges(vcf)

Read a huge VCF file

Reading a big VCF file and many of its fields can be done in chunks to save memory. Then combine the chunks to create a single vcf object retained in memory.

library(VariantAnnotation)

# 1. Choose the fields to read
param <- ScanVcfParam(fixed = c("ALT"), geno = c("GT"), 
            info = c("DP"))

# 2. Decide on a chunk size (number of lines read at a time)
chunksize <- 500000

# 3. Open VCF file for reading in chunks
myVcfFile <- VcfFile("global.chrXXI.snp.vcf.gz", 
                yieldSize = chunksize)
open(myVcfFile)

# 4. Read VCF file in chunks and keep them in the list object vcfchunks
vcfchunks <- list()
k <- 1
while (nrow( vcfchunks[[k]] <- readVcf(myVcfFile, param = param)) ){
        # Print progress
        cat("new lines read:", dim(vcfchunks[[k]]), "\n")
        k <- k + 1
        }

# 5. Combine the chunks into a single vcf object
vcf <- vcfchunks[[1]]
for(j in 2:length(vcfchunks)){
    vcf <- rbind(vcf, vcfchunks[[j]])
  }

# 6. Close the input file
close(myVcfFile)
rm(vcfchunks)

Filter in R

R can be used to carry out additional filtering steps. The operations below do not require that the genotypes are read into memory from the VCF file. Leaving them out will speed up the tasks.

Read VCF file

Commands here leave out the genotypes.

library(VariantAnnotation)

vcfFileName <- ""global.chrXXI.snp.vcf.gz""
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, info = NA)
vcf <- readVcf(file = vcfFileName, param = param)

Keep only biallelic snps

If you haven’t already dropped SNPs having more than one ALT allele, you can do it in R with the following commands. Invariant sites are retained.

# Count the number of entries in the ALT field at each snp
#   (invariants will have a single entry, though blank)
altAlleles <- CharacterList(alt(vcf))
nAltAlleles <- sapply(altAlleles, length)

# Frequency of ALT allele numbers
table(nAltAlleles)

# Keep only SNPs with a single entry in the ALT column
vcf <- vcf[nAltAlleles == 1]

# Number of SNPs remaining
nrow(vcf)

Drop SNPs at masked bases

Most reference genomes have an accompanying file that identifies interspersed repeats and low complexity DNA sequences created using RepeatMasker or similar program. Below I show how to drop these sites from your VCF file.

First, download the Repeatmasker (eg ) file online. For stickleback reference genome v5, the file is stickleback_v5_repeatMasker.gff.gz available at the genome site.

Then read and save these masked repeat regions to a local file as follows.

# Read file
x <- read.table(gzfile("stickleback_v5_repeatMasker.gff.gz"), 
        sep = "\t", fill = TRUE, stringsAsFactors = FALSE)

# Keep the columns indicating start and end positions of repeats
x1 <- x[, c(1,4,5)]
names(x1) <- c("chr","start","end")

# Split into a list object by chromosome
repeatMaskedBases <- split(x1, x1$chr)
length(repeatMaskedBases)
# [1] 24

# Check names
names(repeatMaskedBases)

# Save
save(repeatMaskedBases, file = "stickleback_v5_RepeatMaskedBases.rdd")

Then drop SNPs at masked bases from your vcf object (here named vcf).

# Load saved R object
load("stickleback_v5_RepeatMaskedBases.rdd")

# Pull out repeat regions in the chromosome of interest (here, chrXXI)
repeats <- repeatMaskedBases[["chrXXI"]]

if(length(repeats) > 0){
  
  # Make an object containing the start and end positions of repeats
  repeatMaskedRanges <- IRanges(start = repeats$start, end = repeats$end)
    
  # Make a similar object with start and end positions of snps
  snpRanges <- IRanges(start = start(vcf), end = end(vcf))
    
  # Find which SNPs overlap the repeat regions
  z1 <- findOverlaps(query = repeatMaskedRanges, subject = snpRanges)
  z2 <- unique(subjectHits(z1))
    
  # Remove them
  if(length(z2) > 0) vcf <- vcf[-z2]
  }

# Number of SNPs remaining
nrow(vcf)

Remove SNPs close to indels

The following commands drop SNPs very close to indels (within 3 bases). This requires reading indel positions from an indel-only file created using GATK. My example here uses a hard-filtered indel file named “global.chrXXI.indel.vcf.gz”.

# Read the VCF file of indels - only their positions are needed
indelVcfFileName <- "global.chrXXI.indel.vcf.gz"
indelvcf <- readVcf(file = indelVcfFileName, param = 
              ScanVcfParam(fixed = NA, geno = NA, info = NA))   

# Make an object containing the start and end positions of indels
indelRanges <- IRanges(start = start(indelvcf), end = end(indelvcf))

# Make an object with ranges spanning 3 bases to either side of SNPs in the vcf object. 
snpRanges <- IRanges(start = start(vcf) - 3, end = end(vcf) + 3)

# Find the overlaps
z1 <- findOverlaps(query = indelRanges, subject = snpRanges)
z2 <- unique(subjectHits(z1))

# Delete SNPs close to indels
if(length(z2) > 0) vcf <- vcf[-z2]

# Number of SNPs remaining
nrow(vcf)

Save filtered VCF file

First, make a file that tabulates all the remaining SNPs in your vcf object – those that survived the above filtering steps.

chrom <- as.character(seqnames(vcf))
pos <- start(vcf)
write.table(paste(chrom, pos, sep = "\t"), file = "global.chrXXI.keep.txt", 
    row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\n")

Next, make an new VCF file that keeps only the SNPs that passed the R filters. This is most easily done with vcftools in unix. The commands below use the syntax to run vcftools (and htslib to compress and index the new VCF file) on UBC Digital Alliance computers, but the syntax on your computer system will be similar.

# in unix
vcftools --gzvcf global.chrXXI.snp.vcf.gz \
    --positions global.chrXXI.keep.txt --recode --recode-INFO-all \
    --out global.chrXXI.keep
bgzip global.chrXXI.keep.recode.vcf
mv global.chrXXI.keep.recode.vcf.gz global.chrXXI.snp.Rfiltered.vcf.gz
tabix -p vcf global.chrXXI.snp.Rfiltered.vcf.gz

Other SNP filters

Your VCF file might still contain some unwanted SNPs that could cause issues in downstream analyses. For example, it might contain null alt alleles or SNPs with spanning deletions (indicated by "*"). To remove them, include the following steps in your R code above before you save the vcf object.

ALT <- CharacterList(alt(vcf))
has.null.allele <- sapply(ALT, function(x){"" %in% x})
has.span.deletion <- sapply(ALT, function(x){"*" %in% x})
vcf <- vcf[!has.span.deletion & !has.null.allele]

Filter indels

A VCF file containing only indels can also be filtered. Here is how to keep only biallelic indels and also drop positions corresponding to masked bases.

library(VariantAnnotation)

# Identify input file and choose VCF files to read
vcfFileName <- "global.chrXXI.snp.vcf.gz"
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, info = NA)

# Read VCF file into R
vcf <- readVcf(file = vcfFileName, 
                param = param)

# Show the number of indels
nrow(vcf)

# View indels
rowRanges(vcf)

# To locate biallelic indels, count number of entries in the ALT field
# (invariants will have a single entry, though blank)
altAlleles <- CharacterList(alt(vcf))
nAltAlleles <- sapply(altAlleles, length)

# Frequency of ALT allele numbers
table(nAltAlleles)

# Keep only biallelic indels
vcf <- vcf[nAltAlleles == 1]

# Number of SNPs remaining
nrow(vcf)

# To locate indels at masked sites, load R object repeatMaskedBases
load("stickleback_v5_RepeatMaskedBases.rdd")

# Pull out repeat regions in the chromosome of interest (here, chrXXI)
repeats <- repeatMaskedBases[["chrXXI"]]

if(length(repeats) > 0){
  
  # Make an object containing the start and end positions of repeats
  repeatMaskedRanges <- IRanges(start = repeats$start, end = repeats$end)
    
  # Make a similar object with start and end positions of indels
  indelRanges <- IRanges(start = start(vcf), end = end(vcf))
    
  # Find which indels overlap the repeat regions
  z1 <- findOverlaps(query = repeatMaskedRanges, subject = indelRanges)
  z2 <- unique(subjectHits(z1))
    
  # Remove them
  if(length(z2) > 0) vcf <- vcf[-z2]
  }

# Number of SNPs remaining
nrow(vcf)

# Tabulate remaining indel positions to a text file
chrom <- as.character(seqnames(vcf))
pos <- start(vcf)
write.table(paste(chrom, pos, sep = "\t"), file = "global.chrXXI.keep.txt", 
    row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\n")

Finally, extract and write the list of saved indels to a new VCF file using vcftools in unix.

vcftools --gzvcf global.chrXXI.indel.vcf.gz \
    --positions global.chrXXI.keep.txt --recode --recode-INFO-all \
    --out global.chrXXI.keep
bgzip global.chrXXI.keep.recode.vcf
mv global.chrXXI.keep.recode.vcf.gz global.chrXXI.indel.Rfiltered.vcf.gz
tabix -p vcf global.chrXXI.indel.Rfiltered.vcf.gz

Make a table of SNPs

In R

The following commands read a VCF file (biallelic) and creates a SNP table. Genotypes are coded 0 = missing, 1 = 0/0, 2 = 0/1 or 1/0 and 3 = 1/1. The rows of the table will have the samples, and the columns will have the SNPs.

A “0” is assigned either if the genotype is missing or if the SNP is an an invariant and no ALT allele is present in the sample (ALT allele indicated by “.”)

library(VariantAnnotation)

# Read the VCF file
vcfFileName <- "myVcfFile.snp.gz"
param <- ScanVcfParam(fixed = c("ALT"), geno = "GT", info = NA)
vcf <- readVcf(file = vcfFileName, param = param)

# Convert to a SNP table
snps <- genotypeToSnpMatrix(vcf) # rows = samples, columns = snps
z <- apply(snps$genotypes@.Data, 2, function(x) as.integer(x))
rownames(z) <- colnames(geno(vcf)$GT)
snps <- as.data.frame(z)

This method to write (and read) .csv files is familiar but slow.

# Write to a csv file
csvFileName <- sub(".vcf.gz$", ".snptable.csv", vcfFileName)
write.csv(snps, file = csvFileName, row.names = TRUE)

# Read the SNP table from the saved .csv file
# Use check.names = FALSE to prevent R from revising SNP and sample names
snps <- read.csv(file = csvFileName, row.names = 1, check.names = FALSE)

Use fwrite() and fread() in the data.table package for a much faster write/read.

library(data.table)

# Write to a csv file
csvFileName <- sub(".vcf.gz$", ".snptable.csv", vcfFileName)
fwrite(snps, file = csvFileName, sep = ",", 
    col.names = TRUE, row.names = TRUE) 

# Read the SNP table from the saved .csv file
snps <- as.data.frame(fread(file = csvFileName))
rownames(snps) <- snps[, 1]
snps <- snps[, -1]

Using GATK instead

GATK can also make a table of SNP genotypes from a VCF file, which we can convert to a numeric code by counting the number of ALT alleles for each genotype (0, 1 or 2; NA = missing). Note that this number coding is different from the one resulting from the above method using VariantAnnotation in R.

If a SNP is an invariant in the sample (ALT allele indicated by “.”) then all the non-missing genotypes at that position will be coded as “0” rather than as NA.

Here, the VariantsToTable output table will have the columns CHROM, POS, REF ALT and GT. This indicates the chromosome name, SNP position, the nucleotide for the REF allele, the nucleotide for ALT allele, and then many columns of genotypes, one for each sample individual.

# in unix
gatk VariantsToTable \
     -V myVcfFile.snp.gz \
     -F CHROM -F POS -F REF -F ALT -GF GT\
     -O myTable.snp.txt

Read the output into R using fread() from the data.table package (much faster than read.csv()).

library(data.table)
library(stringr)

# Using fread() from data.table because it is faster than read.csv()
x <- fread(file = "myTable.snp.txt", sep = "\t")

# Convert missing genotypes to missing
x[x == "./."] <- NA

# Create SNP names similar to those by VariantANnotation
snp_name <- paste0(x$CHROM, ":", x$POS, "_", x$REF, "/", x$ALT)

# Recover the sample names
fishid <- sub(".GT$", "", names(x)[5:ncol(x)])

# Code the genotypes as numeric by counting the number of ALT alleles per genotype
snps <- apply(x, 1, function(x){
    str_count(x[5:length(x)], fixed(x["ALT"]))
    })
snps <- as.data.frame(snps)
rownames(snps) <- fishid
names(snps) <- snp_name

# Peek at the results
snps[1:10,1:4]

Miscellaneous R

Read genome into R

How to read a fasta genome file into R.

This code is much faster than using seqinr but it converts any lower case letters to upper case, which is not always desireable.

library(Biostrings) 

genome <- readDNAStringSet("stickleback_v5_assembly.fa.gz", "fasta")

# Chromosome names
names(genome)

# Peek at the contents
genome

# Extract the sequence from chrI positions 1000 to 2500
myseq <- subseq(genome["chrI"], start= 1000, end= 1000)
myseq

# Show only the sequence information
as.character(myseq)

# This genome file has no ambiguous letter codes
head(alphabetFrequency(genome))

Convert between genomes

Snp nucleotide positions in one version of a reference genome can be converted to corresponding positions for the same SNP in a new reference genome if a liftover file has been created.

For example, SNPs discovered using v1 of the stickleback reference genome (Jones et al 2012; Nature) can be converted to their corresponding positions in version 5 of the stickleback reference genome using the liftover file, v1_withChrUn_to_v5.chain.txt, available at Mike White’s genome browser site.

Note: this file does not include liftovers for chrM, which are unchanged. It also does not include chrY, which was not part of the v1 genome.

To demonstrate, I convert from v1 coordinates to v5 coordinates the SNPs in the array developed by Jones et al. 2012; Current Biology

First, read the v1 to v5 liftover file.

library(rtracklayer)
library(GenomicRanges)

# Read liftover file
v1.to.v5_chain <- import.chain("v1_withChrUn_to_v5.chain.txt")

# Check chromosome names
names(v1.to.v5_chain)

Next, read the SNP positions in genome v1. I used the csv file kingsley_snps.csv, which contains information on 1491 stickleback SNPs used in the array by Jones et al. and obtained from the Supplement of that paper and from NCBI where the data are deposited.

The commands below put the SNP positions into a genomic ranges object (containing start and end positions of genomic intervals) with the SNP positions according to the v1 reference genome.

# Read the file
kingsley_snps <- read.csv("kingsley_snps.csv")
head(kingsley_snps)

# Number of snps
nrow(kingsley_snps)

# Extract the relevant columns - strand is optional and can be left out
v1snps <- data.frame(seqnames = kingsley_snps$chr, 
        start = kingsley_snps$POS, end = kingsley_snps$POS, 
        strand = kingsley_snps$strand)

# Make the genomic ranges object with start and end positions of each snp
v1Coordinates <- makeGRangesFromDataFrame(v1snps)
v1Coordinates

Finally, convert the SNP positions to genome v5 coordinates.

The next command carries out the conversion. Each row of v1Coordinates correspond to a list element in v5Coordinates. In this example, 19 of 1491 SNPs could not be converted to the new coordinates and 17 SNPs on chrM were ignored.

# Convert from v1 to v5 positions
v5Coordinates <- liftOver(v1Coordinates, v1.to.v5_chain)

To coerce result to a data frame, use the following (will drop missing values). The variable group will correspond to the rows of the original v1 data frame.

v5snps <- as.data.frame(v5Coordinates)

nrow(v5snps)
head(v5snps)

Principal components (PCA)

There are many ways to do PCA in R. However, one needs a method to account for missing genotypes.

One solution to the problem of missing genotypes is to impute missing genotypes, but this isn’t necessary unless there are very many missing values.

PCA with PLINK2

PCA in PLINK2 works for autosomes only but is really fast. PCA using the earlier version PLINK v1.9 is described at Ravinet and Meier’s helpful speciation genomics web site

PLINK obtains principal components from a variance-standardized relationship matrix (which is like a correlation matrix). It handles missing genotypes by calculating the variance-standardized relationship for each pair of individuals using just the non-missing SNPs they share. This is fine if genotypes are missing at random, but it is difficult to know this.

The calculation is sensitive to minor allele frequency, so an maf (minor allele frequency) cutoff is typically specified.

Here are the commands if running Compute Canada software. The argument --max-alleles 2 is redundant if you’ve already filtered out multi-allelic sites.

# Make BED file and other files needed for pca
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
  --max-alleles 2 --maf 0.01 --allow-extra-chr --make-bed \
  --out global.chrXXI.snp.Rfiltered

# Carry out PCA
plink2 --bfile global.chrXXI.snp.Rfiltered --pca biallelic-var-wts --allow-extra-chr \
  --out global.chrXXI.snp.Rfiltered

PLINK was designed for analyses of the human genome. The argument --set-all-var-ids allows nonstandard names for snps; --allow-extra-chr permits non-standard chromosome names; king-cutoff sets a cutoff for relatedness measured by kinship (duplicate samples have kinship 0.5, whereas 0.25 is the expected value for parent-child or full-sib); --maf sets a cutoff for minor allele frequency.

Plot the PCA scores in R.

# in R
pca <- read.table("global.chrXXI.snp.Rfiltered.eigenvec", 
                sep = "\t", header = TRUE, comment.char = "")
plot(PC2 ~ PC1, data = pca)

# Eigenvalues
cat global.chrXXI.snp.Rfiltered.eigenval

# SNP loadings
pca.loadings <- read.table("global.chrXXI.snp.Rfiltered.eigenvec.var"), 
                        sep = "\t", header = TRUE, comment.char = "")

PLINK with pruning

PLINK will prune markers in high linkage disequilibrium with one another. Whether this is desirable depends on your project.

The argument --indep-pairwise is the command that performs linkage pruning. The first number, 50, sets a window size of 50 Kb. The second argument, 10 is the window step size. The third sets an $r^2$ threshold for the amount of linkage considered acceptable. Here we prune SNPs that have an $r^2$ greater than 0.3.

# Make BED and other files
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
  --max-alleles 2 --maf 0.01 --indep-pairwise 50 10 0.3 \
  --allow-extra-chr --make-bed --out global.chrXXI.snp.Rfiltered

# PCA
plink2 --bfile global.chrXXI.snp.Rfiltered \
  --extract global.chrXXI.snp.Rfiltered.prune.in --pca --allow-extra-chr \
  --out global.chrXXI.snp.Rfiltered

Plot the PCA scores in R.

# in R
pca <- read.table("global.chrXXI.snp.Rfiltered.eigenvec", 
                sep = "\t", header = TRUE, comment.char = "")
plot(PC2 ~ PC1, data = pca)

PCA using `pcadapt`

pcadapt is an R package that uses PCA to detect local adaptation. You may have to install it first.

In the first step, use PLINK to convert genotype data in a VCF file to the BED file format (here shown without using pruning).

# Make BED and other files
plink2 --vcf global.chrXXI.snp.Rfiltered.vcf.gz --set-all-var-ids @:# \
    --max-alleles 2 --maf 0.01 --allow-extra-chr --make-bed \
  --out global.chrXXI.snp.Rfiltered

Then run pcadapt in R.

library(pcadapt)
library(Variantnnotation)

# Grab individual sample names
x <- read.table(famfile.in)
ID <- x[, 2]

# Optional commands to make grouping variable from ID. 
# Here I extract info from names to indicate ecotype (i.e., benthic or limnetic)
# Your command will be different if you are using other labels or population types
ecotype <- rep("limnetic", length(ID))
ecotype[grep("enthic", ID)] <- "benthic"

# Read SNP names from PLINK bim file
snp <- read.table("global.chrXXI.snp.Rfiltered.bim")
names(snp) <- c("chr", "SNP", "cM", "position", "ALT", "REF")

# PCA with 10 components saved
bed <- read.pcadapt("global.chrXXI.snp.Rfiltered.bed", type = "bed")
PCA <- pcadapt(input = bed, K = 10, min.maf = 0.01) 
names(PCA)
     # [1] "scores"          "singular.values" "loadings"        "zscores"        
     # [5] "af"              "maf"             "chi2.stat"       "stat"           
     # [9] "gif"             "pvalues"         "pass"         

# Obtain PCA loadings (NA means SNP was an invariant or otherwise dropped)
pca.loadings <- as.data.frame(PCA$loadings, row.names = snp$SNP)
names(pca.loadings) = paste0("PC", 1:ncol(PCA$loadings))

# Obtain PCA scores
pca.scores <- as.data.frame(PCA$scores, row.names = ID)
colnames(pca.scores) <- paste0("PC", 1:ncol(pca.scores))

# scree plot of eigenvalues
plot(PCA, option = "screeplot")

# plot of PC scores
plot(PCA, option = "scores", pop = ecotype)
plot(PCA, option = "scores", pop = lake)
plot(PCA, option = "scores", i = 3, j = 4, pop = lake)

Divergence metrics

This section shows methods to compare populations or species in allele frequencies and other metrics for a fixed block or window of the genome, emphasizing methods available in R.

Extract the genotypes in a window

We begin by specifying the range of bases to be included in the window and then extract the corresponding genotypes from a VCF file.

library(VariantAnnotation)

# Choose which chromosome
chrname <- "chrXXI"

# Choose the SNP vcf file
vcfFileName <- "global.chrXXI.snp.Rfiltered.vcf.gz"

# Choose which populations to compare
# Here I'm choosing two groups from the Little Campbell River
sampleNames <- scanBcfHeader(file = vcfFileName)[[1]]$Sample
keepSamples <- sampleNames[grep("campbell", sampleNames, ignore.case = TRUE)]
head(keepSamples)

# Set start and end base pairs of window (here, a 10K base window)
range <- c(2000001, 2010000)

# Read the vcf block
param <- ScanVcfParam(fixed = NA, geno = c("GT"), info = NA, 
          samples = keepSamples,
          which = GRanges(chrname, IRanges(start = range[1], end = range[2])))
vcf <- readVcf(file = vcfFileName, genome = "stickleback_v5_assembly.fa.gz", 
          param = param)

# Extract genotypes and transpose
GT <- as.data.frame(t(geno(vcf)$GT))
rm(vcf)

# Make a unique label for the populations
# Here, the first 6 letters of the names are enough
pop <- substr(rownames(GT), 1, 6)
popfreq <- table(pop)
popfreq
# Marine Stream 
        # 69      9

# Identify SNPs that are invariants in this sample
is.invariant <- sapply(GT, function(x){
  x <- gsub("\\||\\/", "", x) 
  length(unique(x[x != ".."])) == 1
  })
table(is.invariant)

# Drop the invariant sites
GT <- GT[, !is.invariant]

# Replace missing genotype symbol with NA
GT[GT == "./."] <- NA

# Keep only SNPs with >= half individuals genotyped in each pop
ngeno <- aggregate(GT, by = list(pop), FUN = function(x){sum(!is.na(x))})[,-1]
test <- apply(ngeno, 2, function(x){ all(x > 0.5 * popfreq)})

GT <- GT[, test]

Allele frequencies (biallelic snps)

This code calculates the number and frequency of alleles in each group in the window. This code is for bi-allelic SNPs only. To begin, run the commands above to obtain pop and GT for a predetermined window of bases, dropping the invariant sites, replacing missing genotype symbols with NA, and and dropping SNPs with too few genotyped individuals per population.

library(stringr)

# Calculate number of REF alleles in each genotype, by population
nref <- aggregate(GT, by = list(pop), 
          FUN = function(GT){sum(str_count(GT, pattern = "0"), na.rm = TRUE)})
rownames(nref) <- nref[,1]

# Calculate number of ALT alleles in each genotype, by population
nalt <- aggregate(GT, by = list(pop), 
          FUN = function(GT){sum(str_count(GT, pattern = "1"), na.rm = TRUE)})
rownames(nalt) <- nalt[,1]

# Calculate ALT allele frequency, by population
alleleFreq <- nalt[,-1]/(nref[,-1] + nalt[,-1])

# Mean absolute allele frequency difference in window
alleleFreqDiffMean <- mean(unlist((abs(alleleFreq[1,] - 
                        alleleFreq[2,]))), na.rm = TRUE)
alleleFreqDiffMean

CSS score between 2 groups

The Cluster Separation Score (CSS) was introduced by Jones et al 2012; Nature as a window-based metric of divergence between two populations. It measures genetic difference between all pairs of individuals as euclidean distance in two dimensions obtained from an MDS analysis of proportion sequence divergence. The CSS score is the average distance between pairs of individuals belonging to different populations minus the average distance between pairs belonging to the same populations

The following code is a slight variation of the original method in Jones et al. To begin, run the commands above to obtain pop and GT for a predetermined window of bases, dropping the invariant sites, replacing missing genotype symbols with NA, and dropping SNPs with too few genotyped individuals per population.

CSS is a measure of divergence between groups and it grows with the size of the window and with increasing number of snps. To make the measurement comparable between different windows, divide CSS by the total number of called bases in each window of predetermined quality (including snps, indels, and invariants).

# Sort the rows of GT
GT <- GT[order(pop), ]
pop <- pop[order(pop)]
poptype <- sort(unique(pop))
if(length(poptype) != 2) stop("CSS only works for two population types")

# Count number of ALT alleles per genotype (snps become rows)
genotypes <- apply(GT, 1, str_count, pattern = "1")

# List all possible pairs of individuals
colpairs <- combn(1:ncol(genotypes), 2)

# List pops of all possible pairs of individuals
gtpairs <- combn(pop, 2, FUN = function(x){paste(x, collapse=",")})

# Identify within and between-population pairs in pairwise combinations
ii <- gtpairs == paste(poptype[1], poptype[1], sep = ",") # cases of "1,1"
jj <- gtpairs == paste(poptype[2], poptype[2], sep = ",") # cases of "2,2"
ij <- gtpairs == paste(poptype[1], poptype[2], sep = ",") # cases of "1,2"
ji <- gtpairs == paste(poptype[2], poptype[1], sep = ",") # cases of "2,1"
m <- sum(pop == poptype[1])
n <- sum(pop == poptype[2])

# Calculate a measure of proportion sequence divergence between 
#   all pairs of individuals at every snp:
# 0 = same genotype, 1 = different genotype, 0.5 = a shared allele

psd <- apply(colpairs, 2, function(x){
        psd <- abs(genotypes[ , x[1]] - genotypes[ , x[2]])/2
        })
colnames(psd) <- gtpairs

# What to do with a missing value for psd?
# Here, I replace missing psd with the mean psd of its poptype comparison:
# The mean of psd between individuals belonging to poptype 1, or
#   the mean of psd between individuals belonging to poptype 2, or
#   the mean of psd between individuals belonging to different poptypes.

# Assign the average psd's to cases of psd = NA
psdUnique <- unique(colnames(psd))
for(p in psdUnique){
    doCols <- c(grep(p, colnames(psd)))
    rowMean <- apply(psd[ , doCols], 1, mean, na.rm=TRUE)
    for(k in doCols){
        psd[is.na(psd[ , k]) , k] <- rowMean[is.na(psd[ , k])]
        }           
  }

# Pairwise differences summed across the window
psdSum <- colSums(psd) 

# Initialize distance matrix
d <- matrix(0, nrow = length(pop), ncol = length(pop))

# Place pairwise distances into distance matrix
d[lower.tri(d,diag = FALSE)] <- psdSum
d <- t(d)
d[lower.tri(d,diag = FALSE)] <- psdSum
rownames(d) <- pop
colnames(d) <- pop

# Obtain first 2 MDS axes
z <- cmdscale(d, k = 2, add = TRUE)$points
            
# Matrix of pairwise distance between individuals on MDS axes
euclid <- as.matrix(dist(z, method = "euclidean")) 
euclid.lower <- euclid[lower.tri(euclid)]

# Sum of differences within and between groups
sii <- sum(euclid.lower[ii])
sjj <- sum(euclid.lower[jj])
sij <- sum( sum(euclid.lower[ij]), sum(euclid.lower[ji]), na.rm=TRUE )

# CSS score
CSS <- ( sij/(m*n) ) - (1/(m+n))*( sii/((m-1)/2) + sjj/((n-1)/2) ) 
CSS

Fst

Here’s how we calculate Fst (method of Weir and Cockerham 1984) using the hierfstat package. The advantage of this package is that it provides all the variance components needed to calculate Fst. The calculation is not fast, however. The fst_WC84() function in the assigner package is said to be much faster. Using it instead requires additional steps and packages to get the data in the right format (not shown here).

Fst is a ratio of the variance among groups divided by the total variance. To get Fst for a window, calculate it as the ratio of the sums of the variance components across SNPs (not the average of the Fst’s). Use the same approach to calculate Fst across multiple windows and the whole genome.

To begin, run the commands above to obtain pop and GT for a predetermined window of bases, dropping the invariant sites, replacing missing genotype symbols with NA, and dropping SNPs with too few genotyped individuals per population. The code below assumes biallelic SNPs only.

library(hierfstat)

# Convert genotypes to hierfstat format - code assumes biallelic SNPs only
genotypes <- GT
genotypes[genotypes == "0/0" | genotypes == "0|0"] <- 11
genotypes[genotypes == "0/1" | genotypes == "1/0"] <- 12
genotypes[genotypes == "0|1" | genotypes == "1|0"] <- 12
genotypes[genotypes == "1/1" | genotypes == "1|1"] <- 22

# Convert character data to numeric
# (note: rownames are lost in the process)
genotypes <- data.frame(lapply(genotypes, as.integer))

# Group ID also needs to be numeric
group <- as.integer(as.factor(pop))

# Weir-Cockerham Fst
z <- wc(cbind(group, genotypes))

# Contents of Fst object
names(z)
    # [1] "call"      "sigma"     "sigma.loc" "per.al"    "per.loc"   "FST"       "FIS"  

# Window Fst
z
    # $FST
    # [1] 0.8322499
    
    # $FIS
    # [1] 0.1918404

# z$sigma.loc contains the variance components per snp
head(z$sigma.loc)
                               # lsiga         lsigb      lsigw
    # chrXXI.2000027_G.T -0.0003637520  7.696227e-05 0.01315789
    # chrXXI.2000036_C.T -0.0003732924  8.085008e-05 0.01351351
    # chrXXI.2000061_G.A -0.0004972829  8.852259e-05 0.01388889
    # chrXXI.2000066_A.G  0.0021377646 -5.930962e-04 0.02666667
    # chrXXI.2000100_G.A -0.0004333496  8.680556e-05 0.01388889
    # chrXXI.2000108_G.T  0.0795661414  1.245831e-02 0.02739726

# Sum the variance components across SNPs in the window
varCompSum <- colSums(z$sigma.loc)
varCompSum
        # lsiga     lsigb     lsigw 
    # 57.698847  2.231086  9.398820 

# Window Fst
varCompSum$lsiga / sum(varCompSum)
    # [1] 0.8322499

VarcompA

The numerator of Fst (calculated in the previous section) is a measure of absolute divergence between populations, similar to dxy. As a window-based measure of divergence, divide the sum of varCompSum$lsiga by the total number of called bases in each window of predetermined quality (including snps, indels, and invariants).

Impute genotypes

Instructions here use Beagle to impute missing genotypes using identity by descent.

To begin, make sure to save only biallelic genotypes to a VCF file in which missing genotypes are coded as “./.” and not “0/0”.

Filter snps

It would be wise to reduce the VCF file to a subset of SNPs successfully genotyped in a high fraction of individuals. The commands below keep SNPs genotyped in at least 75% of individuals.

vcftools --gzvcf global.chrXXI.snp.Rfiltered.vcf.gz \
    --max-missing 0.75 --recode --recode-INFO-all \
    --out global.chrXXI.snp.Rfiltered75
bgzip global.chrXXI.snp.Rfiltered75.recode.vcf
mv global.chrXXI.snp.Rfiltered75.recode.vcf.gz global.chrXXI.snp.Rfiltered75.vcf.gz
tabix -p vcf global.chrXXI.snp.Rfiltered.vcf.gz

Impute

Now, impute the missing genotypes. The commands below for genotpe imputation are those used on a Compute Canada (Digital Alliance) machine. Don’t forget to load the module first. To speed up the calculations I requested 16 cpu’s with 2G of memory per cpu (32G in total).

The following command imputes genotypes using the existing genotypes. The output is a VCF file with imputed genotypes.

java -Xmx31g -jar ${EBROOTBEAGLE}/beagle.22Jul22.46e.jar nthreads=16 \ 
  gt=global.chrXXI.snp.Rfiltered75.vcf.gz out=global.chrXXI.snp.Rfiltered75.imputed
gatk IndexFeatureFile -I global.chrXXI.snp.Rfiltered75.imputed.vcf.gz

Beagle will also impute genotypes using the genotype likelihoods instead. However, you’ll first need to convert any missing likelihoods. The zcat sequence of commands below replaces missing likelihoods with “0,0,0”.

zcat global.chrXXI.snp.Rfiltered75.vcf.gz | sed 's/\\.,\\.,\\./0,0,0/g' | \
    bgzip > global.chrXXI.snp.edited75.vcf.gz
gatk IndexFeatureFile -I global.chrXXI.snp.edited75.vcf.gz
java -Xmx31g -jar ${EBROOTBEAGLE}/beagle.22Jul22.46e.jar nthreads=16 \ 
  gt=global.chrXXI.snp.edited75.vcf.gz out=global.chrXXI.snp.Rfiltered75.imputed
gatk IndexFeatureFile -I global.chrXXI.snp.Rfiltered75.imputed.vcf.gz

Make a VCF file

The rest of this page lists the steps that we use to generate a VCF file starting with fastq files and using the GATK4 best practices pipeline. R can help with this too.

Reference genome

You’ll need a reference genome to use the alignment (BWA) and GATK programs to make a VCF file.

For stickleback we use the v5 reference genome, a Paxton Lake benthic, available at Mike White’s genome browser site. The downloaded file stickleback_v5_assembly.fa.gz is a gzipped fasta file containing the v5 assembly of the stickleback genome. It includes sequence for 24 components: 20 autosomes, the X chromosome (chrXIX), the Y chromosome (chrY), the mitochondrion (chrM), plus “chrUn”, which is an artificial chromosome consisting of unassembled contigs chained together (individual contigs are separated by a large number of NNNNNNNNNs). To peek at the fasta file, use the unix command

zless stickleback_v5_assembly.fa.gz

Index genome file

The reference genome file need to be indexed before using. Unzip the fasta file. Here’s the command for stickleback.

gunzip stickleback_v5_assembly.fa.gz

To index on Compute Canada:

# Generate bwa index
bwa index -a bwtsw stickleback_v5_assembly.fa

# Generate .fai index files needed by GATK
samtools faidx stickleback_v5_assembly.fa

# Generate .dict file needed by GATK
gatk --java-options "-Xmx2g" CreateSequenceDictionary \
    -R stickleback_v5_assembly.fa

Inspect short reads

R can be used to examine number and quality of short reads.

Paired short reads for an individual are in two large text files in .fastq.gz format. For paired reads, the first file (file name with “R1” or just “1”) will have the forward reads, and the second file (file name with “R2” or “2”) will contain reverse complement reads.

Read stats

The fastq.gz files are generally huge, so the R code below takes a large random sample of reads instead (the ShortRead manual says the default number of reads is 1 million, but I don’t think so).

In R:

library(ShortRead)

# choose a fastq file to inspect
myFileName <- dir(getwd(), "fishName.R1.fastq.gz")

# Take large random sample of reads
reads <- qa(myFileName) 

# Show total number of reads read
reads[["readCounts"]]   

# Show base frequencies
reads[["baseCalls"]]

# Show mean and median base quality
z <- rep(as.integer(rownames(reads[["baseQuality"]])), 
          reads[["baseQuality"]]$count)
mean(z, na.rm = TRUE)
median(z, na.rm = TRUE)

# Plot read quality by cycle
perCycle <- reads[["perCycle"]] 
ShortRead:::.plotCycleQuality(perCycle$quality)

# Create summary report
report(reads, dest = paste(myFileName, "QAreport", sep = "."), type = "html")

The report will be placed in a folder named after the fastq.gz files analyzed. Download the folder to your local machine and double click the index.html file inside the folder to view all the plots in your browser and print to a pdf.

Align reads

We use bwa to align paired reads to the reference genome. The alignments are saved in uncompressed .sam files.

`bwa` command

The basic bwa command is as follows (the backslash continues a single command on the next line).

On Compute Canada:

bwa mem -M stickleback_v5_assembly.fa fishName_R1.fastq.gz \
        fishName_R2.fastq.gz > fishName.sam

Go faster

We can speed up this step. A single bwa job can be divided up into multiple threads that are run concurrently on separate cores. Here, the example command below divides the process among 16 cores, speeding up the whole operation many fold. Most nodes on Compute Canada machines have 32 cores, so you can go up to 32 threads.

Here’s the basic command if running on Compute Canada in interactive mode (the salloc command requests 16 cores (CPU’s) and 1G of memory per core).

# in unix
salloc --time=3:0:0 --ntasks-per-node=16 --mem-per-cpu=1G --account=def-schluter

# Wait for the resources to be made available to you.
# Then type the following (substituting the name of your actual fish)

bwa mem -M -t 16 stickleback_v5_assembly.fa fishName_R1.fastq.gz \
        fishName_R2.fastq.gz > fishName.sam

exit # end interactive session

On the ZCU Cluster, no advance request for resources is required. The command here is

/Linux/bin/bwa-0.7.15 mem -M -t 16 stickleback_v5_assembly.fa fishName_R1.fastq.gz \
        fishName_R2.fastq.gz > fishName.sam

GATK SNP calling

Here we take each sam file of aligned reads through the bulk of the GATK4 procedures. We use the steps recommended for “germline short variant discovery (snps + Indels)”. See the tool documentation for the version of GATK you are using.

The output of this sequence of steps is a VCF file containing all the joint genotype calls.

Snp calling involves the following steps.

SortSam to sort the reads in the sam file.
MarkDuplicates to identify duplicate reads (drop this step if doing GBS).
AddOrReplaceReadGroups to assign reads to a single new read group.

BaseRecalibrator to make recalibration table for base quality scores (optional).
ApplyBQSR recalibrates the base quality scores (optional).

HaplotypeCaller to call SNPs for individual specimens.
GenomicsDBImport to construct database of gVCF files.
GenotypeGVCFs to genotype all samples jointly.

Base recalibration (last two steps) is optional but we use it in the standard stickleback pipeline You’ll need to ensure that the following files are in your working directory.

stickleback_21genome_snp_v5.1.bed
stickleback_21genome_snp_v5.1.bed.idx

I also use this shorter list of known snps.

knownSnpsv5.1.vcf
knownSnpsv5.1.vcf.idx

These files contain lists of known SNPs in the stickleack genome. The file stickleback_21genome_snp_v5.1.bed lists the highest-confidence SNPs discovered in the study of 21 stickleback genomes sampled from marine and freshwater sites around the northern hemisphere by Jones et al 2012; Nature. The original file was provided by Felicity Jones and lifted over to the v5 genome coordinates. The second list of SNPs was discovered by Jones et al. 2012; Current Biology obtained from the Supplement of that paper and from NCBI where the data are deposited. They were lifted over to the new v5 coordinates as described below

Read processing

Here’s the sequence of commands applied to a single fish whose sam file name is GrowthHormoneTransgenicLCR_A-line.GHT-A1.sam. I had to go up to 24Gb to avoid crashes.

The flag Xmx22g specifies the maximum amount of memory that can be used. Set the number equal to the amount of memory requested for the job minus a little for overhead.

Here’s what the commands would look like in an interactive session on Compute Canada.

# Request resources for interactive session
salloc --time=12:0:0 --ntasks-per-node=1 --mem-per-cpu=24G --account=def-schluter

# You'll have to wait for the resources to become available to continue
gatk --java-options "-Xms256m -Xmx22g -XX:ParallelGCThreads=8" SortSam -I FISHNAME.sam \
        -O FISHNAME.sorted.bam \
        -SO coordinate --CREATE_INDEX TRUE

gatk --java-options "-Xms256m -Xmx30g -XX:ParallelGCThreads=8" MarkDuplicates \
        -I FISHNAME.sorted.bam \
        -O FISHNAME.mkdup.bam \
        -M FISHNAME.mkdup.metrics \
        --REMOVE_DUPLICATES FALSE --ASSUME_SORT_ORDER coordinate

gatk --java-options "-Xms256m -Xmx4g" AddOrReplaceReadGroups \
    -ID GenomeBC.FISHNAME -LB FISHNAME.SB -SM FISHNAME -PL ILLUMINA \
    -PU GenomeBC -I FISHNAME.mkdup.bam -O FISHNAME.rg.bam \
    -SO coordinate --CREATE_INDEX TRUE

# Optional base quality score recalibration:

gatk --java-options "-Xms256m -Xmx4g" BaseRecalibrator \
    -R stickleback_v5_assembly.fa -I FISHNAME.rg.bam \
        --known-sites knownSnpsv5.1.vcf \
    --known-sites stickleback_21genome_snp_v5.1.bed \
    -O FISHNAME.recal.table

gatk --java-options "-Xms256m -Xmx4g" ApplyBQSR -R stickleback_v5_assembly.fa \
    -I FISHNAME.rg.bam \
    --bqsr-recal-file FISHNAME.recal.table \
    -O FISHNAME.recal.bam

exit # end interactive session

Bam quality metrics

The bam files created in the last step contain a lot of information about read pair numbers and coverage along chromosomes. Here’s how to use R to extract some of this information.

The bam files can be huge. The code below examines one chromosome at a time, using chrXXI as an example. You might need to run this code in an interactive session if the memory demands are too great.

library(Rsamtools)
library(bitops)

# select the bamfile and chromosome to use.
bamfile <- "myBamFile.bam"
chr <- "chrXXI"
baifile <- sub("bam$", "bai", bamfile) # name of bam index file

# obtain chromosome length from genome index
faifile <- read.table("stickleback_v5_assembly.fa.fai", row.names = 1)
chrlength <- faifile[chr,2]

# read fields from the bam file
x <- scanBam(bamfile, index = baifile, 
      param = ScanBamParam(what = c("pos","qwidth","flag","strand","isize"),
      which = GRanges(seqnames = chr, IRanges(1, chrlength)) ))[[1]]

# Drop unaligned reads
ind <- !is.na(x[["pos"]]) 
x <- lapply(x, function(x){x[ind]})

# Calculate coverage of mapped reads
ranges1 <- IRanges(start = x$pos, width = x$qwidth,
      names=make.names(x[["qname"]], unique = TRUE)) 
cover <- as.vector(coverage(ranges1))
mean(cover) # mean coverage over all bases
median(cover) # median coverage

# Plot the frequency distribution of coverage.
hist(cover[cover <= 100], right = FALSE, breaks = 50, 
    main="Coverage", col="firebrick")

# Number of mapped reads. 
x[["mapped"]] <- bitAnd(x[["flag"]], 0x0004) != 0x0004  
x[["mappedmate"]] <- bitAnd(x[["flag"]], 0x0008) != 0x0008
x[["mapped.in.properpair"]] <- bitAnd(x[["flag"]], 0x0002) == 0x0002
sum(x[["mapped"]])   # No. mapped reads
table(x[["strand"]]) # No. mapped reads on "+" and "-" strands
sum(x[["mapped"]] & x[["mappedmate"]]) # No. reads whose mate also mapped
sum(x[["mapped.in.properpair"]]) # No. reads mapped in a proper pair

# Plot of position of mapped reads along the chromosome 
# (the 1-based leftmost position of each query on the reference sequence)

hist(x$pos/10^6, right=FALSE, col="firebrick",
    breaks = 200, xlab = "Position (million bases)")

# Alignment sequence (query) width.
median(x[["qwidth"]], na.rm = TRUE)

# Plot of "insert" or "template" length, the number of bases 
# from the leftmost mapped base to the rightmost mapped base. 
# The leftmost segment has a plus sign and the rightmost has a minus sign.
hist(x$isize[abs(x$isize) <= 700], right = FALSE, 
     col = "firebrick", breaks = 100)

Run HaplotypeCaller

This step uses GATK’s HaplotypeCaller to call SNPs for an individual, separately for each chromosome. Commands below use chrXXI as an example. Each chromosome and individual yields a compressed gVCF file. This step can require significant resources for large chromosomes and probably needs to be run in batch mode.

gatk --java-options "-Xmx4g" HaplotypeCaller \
    -R stickleback_v5_assembly.fa -ERC GVCF \
    -I FISHID.recal.bam \
    -O FISHID.chrXXI.g.vcf.gz \
    -L chrXXI \
    --pcr-indel-model NONE \
    --heterozygosity 0.01 --indel-heterozygosity 0.00125 \
    --output-mode EMIT_ALL_ACTIVE_SITES \
    -A DepthPerAlleleBySample -A FisherStrand -A InbreedingCoeff \
    -A MappingQuality -A MappingQualityRankSumTest \
    -A QualByDepth -A ReadPosRankSumTest

Make database

Next, create the database of gVCF files. The database for chrXXI will be located in a folder named chrXXIdb. If this folder already exists, delete it before running.

The task requires a sample map, a tab-delimited text file with two columns. The first column contains the sample names (fish id’s) line by line. The second column contains the gVCF file names. Here are the first few lines of an example file named chrXXI.map to analyze the chromosome chrXXI.

Marine-Pac-SaritaBay_BC.SAR-29    Marine-Pac-SaritaBay_BC.SAR-29.chrXXI.g.vcf.gz
Solitary-CranbyLake.CRL25         Solitary-CranbyLake.CRL25.chrXXI.g.vcf.gz
Marine-Pac-MudLake_AK.ML-39       Marine-Pac-MudLake_AK.ML-39.chrXXI.g.vcf.gz
Solitary-TroutLake.TRL-2          Solitary-TroutLake.TRL-2.chrXXI.g.vcf.gz
Marine-Pac-DoranPark_CA.CA02-29   Marine-Pac-DoranPark_CA.CA02-29.chrXXI.g.vcf.gz
Marine-Pac-MudLake_AK.ML-25       Marine-Pac-MudLake_AK.ML-25.chrXXI.g.vcf.gz
...

The following R commands can help to make the map file if your file names have the format FISHID.chrXXI.g.vcf.gz

# Grab all files whose names end with "chrXXI.g.vcf.gz"
z <- list.files(pattern = "chrXXI.g.vcf.gz$")

# This works if the rest of the file name is the name of the individual
fishname <- sub(".chrXXI.g.vcf.gz", "", z)

# make the map file
chrXXImap <- data.frame(fishname, z)
write.table(chrXXImap, file = "chrXXI.map", quote = FALSE, 
            col.names = FALSE, row.names = FALSE, sep = "\t")

Now construct the database with GATK.

# in unix
gatk --java-options "-Xmx4g" GenomicsDBImport \
    --genomicsdb-workspace-path chrXXIdb \
    --sample-name-map chrXXI.map \
    -L chrXXI

exit

Joint genotyping

Finally, use GenotypeGVCFs to call SNPs and invariant sites jointly on the samples in your database. Here is code for Compute Canada to run it on chrXXI using the database chrXXIdb created in the previous step. In this example, we include invariant as well as variant sites. The output is sent to a VCF file named global.chrXXI.vcf.gz.

NOTE: GATK used to represent uncalled (missing) genotypes as “./.” in the VCF output of GenotypeGVCFs. But as of GATK 4.2.3.0, missing genotypes might appear as “0/0” instead (except in some cases with phased variants). GATK wants us to know that “You’ll still be able to determine the genotype as missing because the FORMAT DP field will be 0, though.” Later, I use a genotype quality metric to designate missing from non-missing.

# in unix
gatk --java-options "-Xmx4g" GenotypeGVCFs \
  -R stickleback_v5_assembly.fa \
  -V gendb://chrXXIdb \
  -L chrXXI \
  --include-non-variant-sites \
  --max-alternate-alleles 3 \
  --standard-min-confidence-threshold-for-calling 20 \
  -O global.chrXXI.vcf.gz

Extract variants and invariants

This section describes several tools to extract snps, indels, and invariants from a VCF file and filter them on the basis of quality. It also describes ways to extract a subset of samples from VCF files and drop ALT alleles unused in the subset.

Extract SNPs

The following command extracts only SNPs (single nucleotide polymorphisms) and saves the results in a new VCF file. Those SNPs that failed the phred-scaled confidence threshold for calling variants, as determined by the argument --standard-min-confidence-threshold-for-calling in the GenotypeGVCFs command (I used 20, the default is 30), are given a LowQual value in the FILTER column and are removed by the --exclude-filtered argument in the SelectVariants command below.

Note that in GATK, SNPs include single nucleotide substitutions as well as spanning deletions. The ALT allele for a spanning deletion is indicated as * in the VCF file. Most are labeled LowQual and filtered out. Not all downstream programs will accept spanning deletions.

gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.vcf.gz \
  --select-type-to-include SNP \
  --exclude-filtered \
  -O global.chrXXI.snp.vcf.gz

Extract indels

The following command extracts only indels (insertions and deletions) and saves the results in a new VCF file. Those SNPs that failed the phred-scaled confidence threshold for calling variants are given a LowQual value in the FILTER column and are removed by the --exclude-filtered argument in the command below.

gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.vcf.gz \
  --select-type-to-include INDEL \
  --exclude-filtered \
  -O global.chrXXI.indel.vcf.gz

Extract invariant sites

The VCF output of this command includes only invariants – neither SNPs nor indels.

Add the argument --exclude-filtered in the SelectVariants command below (as in the commands above to extract SNPs or indels)) to leave out sites having LowQual in the FILTER column—these are sites that failed the minimum threshold for calling a SNP and will have ‘.’ or ’*’ in the ALT column. Here, I did not include this quality filter and instead prefer to filter invariant genotypes according to genotype quality at in a later stage in the analysis.

gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.vcf.gz \
  --select-type-to-include NO_VARIATION \
  -O global.chrXXI.invar.vcf.gz

Plot SNP quality

Before filtering out low-quality SNPs based on annotations provided in the VCF file, it is a good idea to plot the relevant metrics to get a sense of whether the recommended filtering cutoffs make sense for your data. To do this you’ll need to read some of the contents of the VCF file into R.

A large random sample of SNPs should suffice. If you really want to read the whole VCF file and are running into memory limits (most likely if you are also reading genotype fields too, such as GT and GQ), read the VCF file in chunks, as explained here

Use random sample of snps

The following code (obtained from VCF-tricks) samples 10,000 SNPs (modify below by editing number in ‘shuf -n 10000’). It assumes that the number of header lines (leading lines beginning with “#” in the VCF file) is less than 2000 (modify by editing number in ‘head -n 2000’).

# Number of SNPs in VCF file (number of lines not beginning with '#')
zcat global.chrXXI.snp.filtered.vcf.gz | zgrep -v '^#' | wc -l 

# Which is the first non-header line? (prints line number and a few characters)
zgrep -v -m1 -n '^#' global.chrXXI.snp.filtered.vcf.gz | head -c 20

# Take a random sample of 10,000 snps
(zcat global.chrXXI.snp.filtered.vcf.gz | head -n 2000 | \
  zgrep '^#' ; zgrep -v '^#' global.chrXXI.snp.filtered.vcf.gz | \
  shuf -n 10000 | LC_ALL=C sort -k1,1V -k2,2n) | \
  bgzip > global.chrXXI.snp.10K.vcf.gz
tabix -p vcf global.chrXXI.snp.10K.vcf.gz

Read quality metrics

Now, choose the quality metrics of interest. For example, here I select the total read depth (DP) of each snp, the Quality by Depth (QD) score, and the Fisher’s strand bias (FS) score. Remove the ‘info =’ argument entirely to read all available INFO fields. To save memory, leave out the genotype fields (geno = NA).

# in R:
library(VariantAnnotation)

# Choose which VCF fields to read
param <- ScanVcfParam(fixed = c("ALT"), geno = NA, 
            info = c("DP", "QD", "FS"))

Read the specified fields into a vcf object in R.

# Read VCF file into R
vcf <- readVcf(file = "global.chrXXI.snp.10K.vcf.gz", 
        genome = "stickleback_v5_assembly.fa.gz", 
                param = param)

# View snps
rowRanges(vcf)

# View INFO field variables read
info(vcf)

# Histogram of DP
hist(info(vcf)$DP, right = FALSE, breaks = 40, col = "firebrick")

# Histogram of QD
hist(info(vcf)$QD, right = FALSE, breaks = 40, col = "firebrick")

# Histogram of FS
hist(info(vcf)$FS, right = FALSE, breaks = 40, col = "firebrick")

Plot genotype quality

Set geno = c("GQ") when reading the random sample of SNPs to view the distribution of genotype quality scores (GQ).

param <- ScanVcfParam(fixed = NA, geno = c("GQ"), info = NA)

# Read VCF file into R
vcf <- readVcf(file = "global.chrXXI.snp.10K.vcf.gz", 
        genome = "stickleback_v5_assembly.fa.gz", 
                param = param)

GQ <- as.vector(geno(vcf)$GQ)
hist(GQ, right = FALSE, breaks = 40, col = "firebrick")

Hard filtering

We largely follow GATK’s recommendations on thresholds for hard filtering SNPs and indels according to quality metrics contained in the VCF file, but we confirmed and/or modified based on plots of the data (see previous section).

Hard-filter snps

The following VariantFiltration command hard-filters the snps. SNPs failing the filter are not removed yet. Instead, a FILTER variable in the output file is annotated as PASS if the SNP passes the filter. If a SNP fails a filter, the FILTER variable is annotated with the name of the filter it failed.

In the version of GATK used here, I got many warnings concerning MQRankSum and ReadPosRankSum, but online forums about this say that the warnings should be ignored.

If a SNP is missing a quality annotation (e.g., no MQ value is given for a given snp) then by default it passes that test (change this by setting the --missing-values-evaluate-as-failing argument to true).

The subsequent SelectVariants command removes the SNPs failing the hard-filter step.

gatk --java-options "-Xmx4g" VariantFiltration \
  -V global.chrXXI.snp.vcf.gz \
  -filter "QD < 2.0" --filter-name "QD2" \
  -filter "FS > 60.0" --filter-name "FS60" \
  -filter "SOR > 3.0" --filter-name "SOR3" \
  -filter "MQ < 40.0" --filter-name "MQ40" \
  -filter "MQRankSum < -4.0" --filter-name "MQRankSum-4" \
  -filter "ReadPosRankSum < -4.0" --filter-name "ReadPosRankSum-4" \
  -filter "ExcessHet > 54.69" --filter-name "ExcessHet5469" \
  -O global.chrXXI.snp.hf.vcf.gz
gatk --java-options "-Xmx4g" SelectVariants \
  -V global.chrXXI.snp.hf.vcf.gz \
  --exclude-filtered \
  -O global.chrXXI.snp.filtered.vcf.gz

Hard-filter indels

We used the following code to hard filter indels:

gatk --java-options "-Xmx2g" VariantFiltration \
    -V global.chrXXI.indel.vcf.gz \
    -filter "QD < 2.0" --filter-name "QD2" \
    -filter "QUAL < 20.0" --filter-name "Q20" \
    -filter "FS > 60.0" --filter-name "FS60" \
    -filter "ReadPosRankSum < -4.0" \--filter-name "ReadPosRankSum-4" \
    -filter "ExcessHet > 54.69" --filter-name "ExcessHet5469" \
    -O global.chrXXI.indel.hf.vcf.gz

# Now,remove indels that failed the filters
gatk --java-options "-Xmx2g" SelectVariants \
  -V global.chrXXI.indel.hf.vcf.gz \
  --exclude-filtered \
  -O global.chrXXI.indel.filtered.vcf.gz

Recode missing genotypes

Recent versions of GATK indicate missing genotypes as “0/0”, i.e., the homozygous reference genotype instead of the missing symbol, “./.”. We need to recode to distinguish missing genotypes from true reference homozygotes.

Filtering SNPs and indels according to genotype quality (GQ) will convert missing and other poor quality genotypes to “./.” Here I recode as missing (“./.”) any genotype in the SNP VCF file whose Genotype Quality (GQ) is less than 10.

After hard-filtering genotypes, some ALT alleles at a SNP might no longer occur in genotypes present in the file. The additional SelectVariants command will clear them out. Some SNPs might become invariants — such sites will still be present in the output VCF file, but will be indicated by a “.” in the ALT field.

vcftools --gzvcf global.chrXXI.snp.filtered.vcf.gz \
  --minGQ 10 --recode --recode-INFO-all \
  --out global.chrXXI.snp.filtered
bgzip global.chrXXI.snp.filtered.recode.vcf > global.chrXXI.snp.filtered.recode.vcf.gz
mv global.chrXXI.snp.filtered.recode.vcf.gz global.chrXXI.snp.GQfiltered.vcf.gz
tabix -p vcf global.chrXXI.snp.filtered.recode.vcf.gz

gatk --java-options "-Xmx4g" SelectVariants \
    -V global.chrXXI.snp.filtered.recode.vcf.gz \
    --remove-unused-alternates \
    -O global.chrXXI.snp.GQfiltered.vcf.gz

Here is the same procedure applied to indels.

vcftools --gzvcf global.chrXXI.indel.filtered.vcf.gz \
  --minGQ 10 --recode --recode-INFO-all \
  --out global.chrXXI.indel.filtered
bgzip global.chrXXI.indel.filtered.recode.vcf > global.chrXXI.indel.filtered.recode.vcf.gz
mv global.chrXXI.indel.filtered.recode.vcf.gz global.chrXXI.indel.GQfiltered.vcf.gz
tabix -p vcf global.chrXXI.indel.filtered.recode.vcf.gz

gatk --java-options "-Xmx4g" SelectVariants \
    -V global.chrXXI.indel.filtered.recode.vcf.gz \
    --remove-unused-alternates \
    -O global.chrXXI.indel.GQfiltered.vcf.gz

Here is a similar procedure applied to invariant sites. The invariant VCF file file has an RGQ field indicating genotype quality rather than a GQ field. We are lenient and use a mininum read depth DP of 1 to call genotypes. Genotypes based on DP < 1 are converted to missing.

vcftools --gzvcf global.chrXXI.invar.vcf.gz \
    --minDP 1 --recode --recode-INFO-all \
    --out global.chrXXI.invar.filtered
mv global.chrXXI.invar.filtered.recode.vcf global.chrXXI.invar.DPfiltered.vcf
bgzip global.chrXXI.invar.DPfiltered.vcf
tabix -p vcf global.chrXXI.invar.DPfiltered.vcf.gz

Select subset of individuals

To extract results for a subset of individuals, use SelectVariants and an expression to identify the chosen sample names.

The following example code selects all individuals whose sample names include the label “Benthic”, but excluding Benthics also having the label “Emily”, and additionally leaving out two specific Benthic individuals.

Some SNPs might now be invariant within the subset. Including the optional --remove-unused-alternates will remove ALT alleles in such cases and make them easier to detect. Invariant sites will be indicated by a “.” in the ALT field of the output VCF file.

gatk --java-options "-Xmx2g" SelectVariants \
 -V global.chrXXI.snp.filtered.vcf.gz \
 -se "Benthic" \
 -xl-se "Emily" \
 -xl-sn "PaxBenthic2010.PAX-B-113" \
 -xl-sn "PaxBenthic2010.PAX-B-114" \
 --remove-unused-alternates \
 -O subset.chrXXI.snp.filtered.vcf.gz

Keep only biallelic variants

Many analysts keep only biallelic SNPs (and indels). This step can be accomplished by inserting the extra line
--restrict-alleles-to BIALLELIC \
into one the above SelectVariants commands, perhaps right after --remove-unused-alternates.

Alternatively, this can be done later in R, which I show in the next section.

Merge VCF files

After processing VCF files separately by chromosome, it may be necessary to merge them into a single large VCF file. Use GATK’s MergeVcfs instead if the chromosomes aren’t listed in the same order as in the headers of each of the individual files. Use SortVcf instead if the individual VCF files are not sorted by coordinates.

`GatherVcfs`

You can use use GatherVcfs in GATK to merge VCF files. The chromosomes should be listed in the same order as in the headers of each of the individual files and cannot have overlapping positions. The samples must be the same in each of the files.

gatk --java-options "-Xmx4G" GatherVcfs \
    -I subset.chrI.snp.filtered.vcf.gz \
    -I subset.chrII.snp.filtered.vcf.gz \
    -I subset.chrIII.snp.filtered.vcf.gz \
    -I subset.chrIV.snp.filtered.vcf.gz \
    -I subset.chrIX.snp.filtered.vcf.gz \
    -I subset.chrV.snp.filtered.vcf.gz \
    -I subset.chrVI.snp.filtered.vcf.gz \
    -I subset.chrVII.snp.filtered.vcf.gz \
    -I subset.chrVIII.snp.filtered.vcf.gz \
    -I subset.chrX.snp.filtered.vcf.gz \
    -I subset.chrXI.snp.filtered.vcf.gz \
    -I subset.chrXII.snp.filtered.vcf.gz \
    -I subset.chrXIII.snp.filtered.vcf.gz \
    -I subset.chrXIV.snp.filtered.vcf.gz \
    -I subset.chrXIX.snp.filtered.vcf.gz \
    -I subset.chrXV.snp.filtered.vcf.gz \
    -I subset.chrXVI.snp.filtered.vcf.gz \
    -I subset.chrXVII.snp.filtered.vcf.gz \
    -I subset.chrXVIII.snp.filtered.vcf.gz \
    -I subset.chrXX.snp.filtered.vcf.gz \
    -I subset.chrXXI.snp.filtered.vcf.gz \
    -O subset.merged.snp.filtered.vcf.gz
gatk IndexFeatureFile -I subset.merged.snp.filtered.vcf.gz

Computer resources

Login to a computer system using ssh in a terminal window. From off campus you might need to run myVPN and endure multi-factor authentication (not needed yet on Compute Canada).

For example, to login to Graham at Compute Canada:

ssh -X -Y YOUR_NAME@graham.computecanada.ca

Or to log into ARC Sockeye:

ssh -Y -X YOUR_NAME@sockeye.arc.ubc.ca # use cwl password

Compute Canada and ARC Sockeye keep track of how efficiently you use its memory and CPU resources. Wasting resources reduces your priority rating (future jobs will sit longer in the queue). So it is a good idea to request the minimum amount that you need, and in the case of interactive more exiting as soon as you are done so resources don’t sit idle. See the section below on monitoring jobs to figure out what resources your successful jobs used.

Software available

Compute Canada has a lot of software pre-installed, including GATK, R, and most Bioconductor packages.

ARC Sockeye doesn’t have a lot of software installed, but you can load the Compute Canada software module and then have access to all the same software, as follows:

# Sockeye only
module load CVMFS_CC

Later, I summarize how to install packages on Sockeye.

The ZCU Cluster doesn’t have an up-to-date software list.

Run programs

Find the most recent installed version of the program listed here, if using Compute Canada, and load the corresponding module.

For example, to run a GATK command (the example here creates a sequence dictionary of the stickleback reference genome), load the gatk module and then run the command

gatk --java-options "-Xmx2g" CreateSequenceDictionary \
    -R stickleback_v5_assembly.fa

You can run Compute Canada software on ARC Sockeye by loading this module:

# Sockeye only
module load CVMFS_CC
module gatk/4.2.4.0
gatk --java-options "-Xmx2g" CreateSequenceDictionary \
    -R stickleback_v5_assembly.fa

Run R

If you are using a Mac, install XQuartz so that R can display plots on the screen of your local computer.

To run R on Compute Canada, load the module first.

module load gcc/9.3.0 r/4.1.0 r-bundle-bioconductor/3.14
R

If using the Compute Canada software module on ARC Sockeye, type

# on ARC Sockeye only
module load CVMFS_CC
module load gcc/9.3.0 r/4.1.0 r-bundle-bioconductor/3.14
R

On the ZCU Cluster, to run R and access Bioconductor packages type

# On the ZCU Cluster only
/Linux/R-4.2.0/bin/R

Type library() on the R command line to check which R packages are already available. Additional packages can be installed in your local folder. For example:

# in R
install.packages("hierfstat")

Use a scratch folder

Most work on Compute Canada and ARC Sockeye will be done in a scratch folder, which has lots of disk space but is not backed up. Inactive files eventually expire.

It can be convenient, the first time you use the system, to create a link to your scratch folder by typing the following into a command window.

ln -s /scratch/YOUR_NAME scratch  # make a symbolic link
cd scratch                        # go to your scratch folder

Interactive mode

For tasks on Compute Canada or ARC Sockeye requiring significant time and memory resources, try running your code first in interactive mode. This will help you decide how much resources you’ll need if you run similar jobs in batch mode.

On Compute Canada, start an interactive session with the following unix command, which requests 32G of memory on a single core for 1 hour. You may need to wait a while for the interactive job to begin. If your work is done before the requested time is up, it is important to enter exit to end the interactive session so that resources don’t sit idle until your full time runs out.

salloc --time=1:0:0 --ntasks-per-node=1 --mem-per-cpu=32G --account=def-schluter

[type your commmands]

exit

On ARC Sockeye you need to submit a bash script to start an interactive job. Instructions are here

More job tips

Quick reference for a few useful tasks.

Check disk storage space

Lots of files are generated when processing genome sequence data, which starts to eat up disk space. Remove files created along the way that you no longer need.

To check disk space usage on a Compute Canada computer on all file systems, type the following command on your unix command line.

diskusage_report

Use the following command instead on ARC Sockeye.

print_quota

Check your priority in the job queue

Especially on Compute Canada machines, your priority (affecting how quickly your queued job is run) declines as you use (or waste) resources. It takes a week or so to rebuild. One strategy when this happens is to switch to a different computer system within Compute Canada. If you have been working on Graham and your jobs are bogged down in the queue, move files to Beluga and try again. Priority is computer-specific.

How you tell my group’s priority when you are logged in to a Compute Canada computer system:

sshare -l -A def-schluter_cpu -a --format=account,User,EffectvUsage,LevelFS

SLURM batch jobs

ComputeCanada uses SLURM to schedule batch jobs (so does ARC Sockeye as of November, 2023).

Follow the instructions here on how to create a job script. To submit the job, enter

# in unix
sbatch YOUR_SCRIPT.sh

Other useful commands include:

# Check the status of all jobs submitted
squeue -u YOUR_NAME

# Cancel a running job having job id 123456
scancel 123456

# Check resources used by successfully completed job 123456
seff 123456

An ExitCode of 0 means that no errors were generated. A low CPU Efficiency means that you requested more resources time than you needed. A low Memory Efficiency means that you requested many more memory than you needed.

To check resources used by all jobs submitted between two dates.

sacct --user=YOUR_NAME --starttime=yyyy-mm-dd --endtime=yyyy-mm-dd \
  --format=JobID,JobName,AllocCPUS,ExitCode,MaxRSS,Elapsed,UserCPU

Install programs locally

Use conda to install programs in your own local directory. This is useful on ARC Sockeye because there is so little software pre-installed. My lab allocation code is st-dolph-1; substitute your own if you are installing jobs on ARC sockeye.

Install R in project folder:

# Load conda and create and activate conda environment
module load miniconda3
conda create -p /arc/project/st-dolph-1/miniconda3/envs/bioconductor
source activate /arc/project/st-dolph-1/miniconda3/envs/bioconductor

# Install R and bioconductor
conda install -c bioconda bioconductor-all

# To end session
conda deactivate

To install an R or bioconductor package, such as VariantAnnotation, search for “conda VariantAnnotation” in google and follow the instructions. For example, here’s how I installed it.

module load miniconda3
source activate /arc/project/st-dolph-1/miniconda3/envs/bioconductor
conda install -y -c bioconda bioconductor-variantannotation

Install most recent version of GATK4

# Install
module load miniconda3
conda create -p /arc/project/st-dolph-1/miniconda3/envs/gatk4
source activate /arc/project/st-dolph-1/miniconda3/envs/gatk4
conda install -c bioconda gatk4

# To run
module load miniconda3
source activate /arc/project/st-dolph-1/miniconda3/envs/gatk4