Non-Batch Submissions to the SRA

Since many of you have smaller amounts of data (a handful of samples/runs per experiment), I wanted to provide some info on submitting that data to the SRA without the help of their staff. The SRA has recently re-vamped their submission process, and it is much easier/simpler than before.

It’s important that everyone submit their own data. The SRA provides a free off-site backup, and we have already replaced corrupted files on our computers with clean copies obtained from the SRA. The time you spend doing this will ensure that you always have a clean copy of your raw data.

Here is a brief overview of the submission process, as well as the info you’ll need.

As well as the NCBI’s “quick start guide”.

Here are the pages for submitting BioProjects and BioSamples:

The process now is fairly straightforward and well-documented by the SRA. If anyone has trouble, you can ask others in the lab that have submitted their own data in the last year or so. I believe Dan Bock is one of them.

Here are some of the lab’s existing bioprojects. These can give you an idea of what kind of info to include in the abstract/project description, etc.:

Mounting the Moonrise NFS

Edit: As of February 2015, all our servers are running CentOS 7. The original instructions below are for Debian/Ubuntu linux, but here is a better set of generalized instructions for CentOS:

If you are mounting on an Ubuntu/Debian system, you can still use the instructions below. If you are mounting on a Red Hat derivative (Fedora, RHEL, CentOS, ScientificLinux, etc.), the link above should work.


I just had to re-learn how to do this today, so I thought it would be a good idea to write it up.

If any of you would like to mount the NFS on a computer (Unix, and static IPs only. This means no wireless!) in the building, you can do so at your convenience using this guide.

First, install nfs-common with your package manager (ubuntu: apt-get install nfs-common)

Next, create a folder for the mount to bind to on your computer, and make sure the permissions are set to 777:

user@mycomputer: mkdir -p /nameofyourNFSfolder
user@mycomputer: chmod 777 /nameofyourNFSfolder

I think the whole tree needs the same permissions, so what I’ve done for all our machines (and what seems easiest) is to make a folder in the root directory, so that you don’t have to worry about the permissions in parent folders.

Next, the /etc/hosts.allowed and /etc/exports files on moonrise need to be modified. Chris, Frances, and I all have the necessary permissions to do this. Just ask one of us and we can do it.

root@mooonrise: nano /etc/exports

Add the following to the line beginning with /data/raid5part1
137.82.4.XXX(rw,sync) (with XXX standing in for the static IP of your machine)

You could also do this with machines in other buildings/off-campus as long as their IPs are static.

root@moonrise: nano /etc/hosts.allow

Your IP has to be added to the end of each line in this file.

Now reload the /etc/exports file on moonrise (a full NFS restart is not required, and will unmount it on other machines! don’t do that unless you know for sure that no one is using the NFS on any of our computers!)

root@moonrise: exportfs -a

Finally, mount the NFS on your machine:

user@mycomputer: sudo mount -v -o nolock -t nfs /nameofyourNFSfolder

There are various options you can use with the mount command, but the above should work for just about anyone.

If you want it to auto-mount each time you boot your computer, you can add the following lines to your /etc/fstab file:
#moonriseNFS /nameofyourNFSfolder nfs auto 0 0

That’s it!

Sequencing Data Organization Update


I’ve created a skeleton directory structure and included a few example folders so that everyone can get a better idea of how our data will be organized on the new sever. These are not set in stone. A few people have commented on the blog, or in lab meeting, or to me in person, and I’ve taken all of your suggestions into account.

If you feel like the setup here isn’t optimal, please give some feedback. The better we do this now, the more smoothly things will run in the future!

Sequencing Data Curation Part 1

With our new data server (Moonrise) up and ready to store our sequences, it is time to start being more careful about where and when we move our most important data, and how we keep track of it. I’ve devised a system for storing our data for those of you who will be accessing it via the servers. Only Chris, Sariel, Frances and I will have write access to the directories where data is stored. If you would like your data stored, you will have to fill out a form which I’ve created which will give us all the information we need to store it in its right place. Here is the form.

This is inserting a little bureaucracy into our system, and it’s going to be a pain, but in the long run it will make things much easier. We currently have data which we had a very difficult time finding because the owner is no longer in the lab. With a system like the new one, that will not happen.

We will store our WGS, RNASeq, and GBS data in separate folders. This will make finding your data easier in most cases.

Here are the directory structures for the three types of data:

WGS -> Species -> Cultivar -> Library -> Experiment -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

RNASeq -> Experiment  (if unnecessary, the following directories can be omitted) -> Species -> Cultivar/Population -> Library -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt

GBS is a little more complex, and we will separate things in two ways.
GBS -> Cut Site/Enzyme (data with different cut sites might be incompatible) -> Data type (population genetics, mapping data) -> From here on things diverge
Pop -> Group* -> files_with_descriptive_names.fq METADATA.txt
Map -> Experiment -> Species -> files_with_descriptive_names.fq METADATA.txt
*groups are based loosely on clades, and on how much data for each species we have (annuus_wild, annuus_cult, argophyllus, bolanderi_exilus, petiolaris_clade, hybrids, perennials, tuberosus_cult)

Generally, file names should include data not encoded in the directory structure, but important enough to be seen when perusing the data. Things like depth of coverage, library size, etc. seem appropriate for all three data types, but for types with which I’m not as familar (GBS), suggestions would be appreciated.

Batch Sequence Read Archive Submissions

Getting all of our data uploaded to the SRA is important. It is good to share our data publicly whenever possible, and just as importantly, it provides us with a free off-site backup of our raw data.

We keep a spreadsheet of the lab’s sequencing data here:
*IMPORTANT*: The spreadsheet is curated by a few members of the lab and is not complete. Your data may not be listed here, if it is not, please do your best to add it. You can email me and Sebastien with any questions you might have.

I’ve added a column that indicates the submission status of each sample in the Sunflower Relatives and Wild Sunflowers tabs, so you can tell if your data has already been submitted.

Continue reading

Methods Write-up for Identifying Contamination in Sequencing Reads

Chris and I devised and implemented this method to identify contaminant DNA in our HA412 454 and Illumina WGS reads.

N.B. I will update this posts with links to the scripts once I get them up on the lab’s bitbucket account, plus extra details if I think they are necessary on review.

Continue reading