When analyzing genomic data, we first need to align to the genome. There are a lot of possible choices in this, including BWA (medium choice), stampy (very accurate) and bowtie2 (very fast). Recently a new aligner came out, NextGenMap. It claims to be both faster and deal with divergent read data better than other methods. Continue reading
I’ve been doing work on a stickleback GBS dataset and we’re trying to figure out how many samples we can cram into a single lane of illumina. I did some analyses which people may find useful. It’s unclear how applicable the recommendations are for sunflower which seems to have more problems than fish.
Take home message, with 25% less data you lose around 15% of the genotype calls, but up to 50% of the SNPs if you use a stringent coverage filter, due to how the lost data is distributed among loci and individuals.
I’ve created a skeleton directory structure and included a few example folders so that everyone can get a better idea of how our data will be organized on the new sever. These are not set in stone. A few people have commented on the blog, or in lab meeting, or to me in person, and I’ve taken all of your suggestions into account.
If you feel like the setup here isn’t optimal, please give some feedback. The better we do this now, the more smoothly things will run in the future!
With our new data server (Moonrise) up and ready to store our sequences, it is time to start being more careful about where and when we move our most important data, and how we keep track of it. I’ve devised a system for storing our data for those of you who will be accessing it via the servers. Only Chris, Sariel, Frances and I will have write access to the directories where data is stored. If you would like your data stored, you will have to fill out a form which I’ve created which will give us all the information we need to store it in its right place. Here is the form.
This is inserting a little bureaucracy into our system, and it’s going to be a pain, but in the long run it will make things much easier. We currently have data which we had a very difficult time finding because the owner is no longer in the lab. With a system like the new one, that will not happen.
We will store our WGS, RNASeq, and GBS data in separate folders. This will make finding your data easier in most cases.
Here are the directory structures for the three types of data:
WGS -> Species -> Cultivar -> Library -> Experiment -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt
RNASeq -> Experiment (if unnecessary, the following directories can be omitted) -> Species -> Cultivar/Population -> Library -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt
GBS is a little more complex, and we will separate things in two ways.
GBS -> Cut Site/Enzyme (data with different cut sites might be incompatible) -> Data type (population genetics, mapping data) -> From here on things diverge
Pop -> Group* -> files_with_descriptive_names.fq METADATA.txt
Map -> Experiment -> Species -> files_with_descriptive_names.fq METADATA.txt
*groups are based loosely on clades, and on how much data for each species we have (annuus_wild, annuus_cult, argophyllus, bolanderi_exilus, petiolaris_clade, hybrids, perennials, tuberosus_cult)
Generally, file names should include data not encoded in the directory structure, but important enough to be seen when perusing the data. Things like depth of coverage, library size, etc. seem appropriate for all three data types, but for types with which I’m not as familar (GBS), suggestions would be appreciated.
We recently had some trouble estimating insert sizes with our Mate Pair (aka Jumping, larger insert sizes) Libraries. All the libraries sequenced by Biodiversity and the Genome Sciences Centre (GSC) were shockingly bad, but the libraries sequenced by INRA were very good. For example, according to the pipeline, the GSC 10kbp insert size library had an average 236bp insert size, but the INRA 20kb library an average insert size of 20630bp.
See the histogram for the 10kbp library:
Getting all of our data uploaded to the SRA is important. It is good to share our data publicly whenever possible, and just as importantly, it provides us with a free off-site backup of our raw data.
We keep a spreadsheet of the lab’s sequencing data here: http://bit.ly/17Z4X1P
*IMPORTANT*: The spreadsheet is curated by a few members of the lab and is not complete. Your data may not be listed here, if it is not, please do your best to add it. You can email me and Sebastien with any questions you might have.
I’ve added a column that indicates the submission status of each sample in the Sunflower Relatives and Wild Sunflowers tabs, so you can tell if your data has already been submitted.
Chris and I devised and implemented this method to identify contaminant DNA in our HA412 454 and Illumina WGS reads.
N.B. I will update this posts with links to the scripts once I get them up on the lab’s bitbucket account, plus extra details if I think they are necessary on review.