SNP summary statistics in R: ‘hierfstat’ is back and better than before! (Rose)

After being disabled and not supported for several months, ‘hierfstat’ (by Jerome Goudet) now has lots of useful (and fast) calculations of summary statistics, including expected and observed heterozygosity, Fst and Jost’s Dest.

 

In response to Greg’s suggestion, here’s some code to illustrate how I’ve used it (so far).

My data files are organised with individuals in columns and SNPs in rows (missing data are coded as NA). The first few columns contain information about the SNPs (contig, position etc.). I also read a text file containing the population assignments of the individuals in the form of a single column, ordered as they are in the data file. This is used to identify which columns belong to each populations, which are extracted to the dataframe “gen” before being converted from “GT” format to “34” format with the command above. Then you have to add a column containing the population assignments, converted to integers.

#####################################################################
## Setup
#####################################################################
# Find input file
prefix="radCOtoAnnML.grsa.full80-100perc.var2"
complete=read.table(paste(prefix,".fullgenotypes.dat",sep=""),sep="\t",header=T,na.strings = c("NA","-"),stringsAsFactors = FALSE)
# Specify other important information
info=3 # number of information columns preceding SNP genotypes
pop.names=c("D","N","I","A","P") # desired order for the populations
pop.pairs=c("DN","DI","DA","DP","NI","NA","NP","IA","IP","AP") #desired order for pairs of populations
listdir="~/rfiles/SNPanalysis/"
poplist=factor(unlist(read.table(paste(datadir,"ecotyperaw.txt",sep=""))),levels=pop.names) # a text file containing a single column indicating the population assignment of each individual in the order of the columns in the SNP file
subpoplist=as.factor(unlist(read.table(paste(datadir,"popraw.txt",sep="")))) # optional
# Calculate population columns
listcols0=lapply(levels(poplist),function(x)which(poplist==x))
listcols=lapply(levels(poplist),function(x)which(poplist==x)+info)
datcols=info+1:length(poplist)
#####################################################################
## Single population statistics
#####################################################################
# Set up frame to hold results
paramlist=c("n","Ho","Hs","Fis")
columnlist=paste(rep(paramlist,rep(length(pop.names),length(paramlist))),rep(pop.names,length(paramlist)),sep="_")
tmpmat=matrix(nrow=nrow(complete),ncol=length(columnlist), dimnames=list(rownames(complete),columnlist))
tmpframe=as.data.frame(tmpmat)
# Extract data columns
gen=complete[,datcols]
# Run for all columns, only extract single-population statistics
kmax=ceiling(nrow(gen)/5000)
for(k in 1:kmax){
n=5000
include=5000*(k-1)+(1:n)
if(5000*k > nrow(complete))include=(5000*(k-1)+1):nrow(complete)
gen=complete[include,datcols]
for(i in 1:ncol(gen)){for(j in 1:4)(gen[,i]=gsub(c("A","C","G","T")[j],j,gen[,i]))}
for(i in 1:ncol(gen))gen[,i]=as.numeric(gen[,i])
dat=as.data.frame(cbind(factor(poplist,levels=pop.names),t(gen)))
tmplist=basic.stats(dat,diploid=TRUE,digits=4)
tmpframe[include,]=cbind(tmplist$n.ind.samp,tmplist$Ho,tmplist$Hs,tmplist$Fis)
}
sumstats=tmpframe

 

3 thoughts on “SNP summary statistics in R: ‘hierfstat’ is back and better than before! (Rose)

  1. I’m glad to hear they fixed this! thanks for letting me know. Could you post an example data frame of what the data should look like? That would be awesome.

Comments are closed.