Sequencing Data Curation Part 1

With our new data server (Moonrise) up and ready to store our sequences, it is time to start being more careful about where and when we move our most important data, and how we keep track of it. I’ve devised a system for storing our data for those of you who will be accessing it via the servers. Only Chris, Sariel, Frances and I will have write access to the directories where data is stored. If you would like your data stored, you will have to fill out a form which I’ve created which will give us all the information we need to store it in its right place. Here is the form.

This is inserting a little bureaucracy into our system, and it’s going to be a pain, but in the long run it will make things much easier. We currently have data which we had a very difficult time finding because the owner is no longer in the lab. With a system like the new one, that will not happen.

We will store our WGS, RNASeq, and GBS data in separate folders. This will make finding your data easier in most cases.

Here are the directory structures for the three types of data:

WGS -> Species -> Cultivar -> Library -> Experiment -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

RNASeq -> Experiment  (if unnecessary, the following directories can be omitted) -> Species -> Cultivar/Population -> Library -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

GBS is a little more complex, and we will separate things in two ways.
GBS -> Cut Site/Enzyme (data with different cut sites might be incompatible) -> Data type (population genetics, mapping data) -> From here on things diverge
Pop -> Group* -> files_with_descriptive_names.fq METADATA.txt
Map -> Experiment -> Species -> files_with_descriptive_names.fq METADATA.txt
*groups are based loosely on clades, and on how much data for each species we have (annuus_wild, annuus_cult, argophyllus, bolanderi_exilus, petiolaris_clade, hybrids, perennials, tuberosus_cult)

Generally, file names should include data not encoded in the directory structure, but important enough to be seen when perusing the data. Things like depth of coverage, library size, etc. seem appropriate for all three data types, but for types with which I’m not as familar (GBS), suggestions would be appreciated.

2 thoughts on “Sequencing Data Curation Part 1

  1. Hey Evan,

    Good to see someone is taking the lead! Well done!

    Two comments:
    For RNAseq: I am not sure we want to separate things first by experiments and then by species. There are lots (if not most) of RNAseq data that has already been used in more than one study and this will only increase in the future. I suggest using the same directory structure as the WGS.

    For GBS: I am unclear if this for demultiplexed reads or raw reads?



    • This is for de-multiplexed data. This is the most useful form of the data for future users, so this is what we will store on our high-access data server. We can store the raw data in another designated space on silo (the low access machine).

      For the RNASeq, I only worry that we will lose information about which samples were run in the same batch, so to speak. I haven’t read up on batch effects for RNASeq, though. I’m assuming they are less important in comparison to microarray data.

      If there is a hybridization experiment, and we are separating things by species, the parents will be separated from the F1s, and in this case we would need a way to specifically identify the parents vs any other sequenced individuals from the same species/cultivar/population.

      I’m open to suggestions on how to handle this case. I know for sure of one hybridization experiment like this, and I’m sure the lab will do more in the future, so we should figure out how best to deal with it now.

Comments are closed.