Blog comments closed!? Short term fix

Featured

For some reason, caused by either WordPress or an associated plugin, comment sections are closed for all the current and new posts, as far as I can tell. The ultimate solution may require a bit of work. But for the moment, if you write a post or want to comment on a post, do this to turn on the comments:

  1. Log in.
  2. Go to the RLR blog dashboard (dial icon, top left of screen).
  3. Click on Posts/All Posts on left-side panel.
  4. Find the slug of the post you are interested in. Click the “Quick Edit” link under the post’s title.
  5. Click the box “Allow comments”. Comments should now be turned on, go to post and comment at will.

If anyone feels inclined to come up with a universal or permanent solution, this might be a place to start.

Better (but not perfect) HMW DNA extraction protocol

I wrote some time ago about the protocol I used to prepare HMW DNA for the new HA412 assembly. The advantage of that protocol is that it doesn’t need much tissue to start with, it’s quick and can work quite well. However, it is also quite unreliable, and will sometimes fail miserably.

To prepare HMW DNA for H. anomalus I tried a different protocol, suggested by Allen Van  Deynze at UC Davis. They used it on pepper to prepare HMW DNA for 10X linked reads (the same application I had in mind), and obtained fragments of average size ~150-200 Kb. The resulting 10X assembly was quite spectacular (N50 = 3.69 Mbp for a 3.21 Gbp genome) and was recently published. Continue reading

Even-more-diluted BigDye

It turns out that you can use half the amount of BigDye that is recommended by NAPS/SBC for Sanger sequencing and have no noticeable drop in sequence quality. The updated recipe for the working dilutions:

BigDye 3.1 stock      0.5 parts

BigDye buffer            1.5 parts

Water                       1 part

I will prepare all future working dilutions using this recipe and put them in the usual box in the common -20ºC freezer. For more details on how to prepare a sequencing reaction see this post, and for how to purify them see this or this posts.

Minimizing job queuing delays on Compute-Canada / SLURM

If you have several thousands of similar jobs to submit to a SLURM cluster such as Compute-Canada, one of your goals, aside from designing each job run as quickly as possible, will be to reduce the queuing delays — time spent waiting for an idle node to accept your job. Scheduling delays can quickly shadow the runtime of your job if you do not take care when requesting job resources — you may wait hours, or days even. One way to minimize waiting, of course, to schedule fewer jobs overall (e.g. by running more tasks within a given allocation if your allocation’s time is not expired). If you can fit everything in a single job, then perfect. But what if you could divide the experiment in two halves to run concurrently, thereby obtaining close to 200% speedup, and also spend less wallclock time in the job queue, waiting?

When working with compute tasks whose execution parameters can be moved along  several different resource requirement axes (time, # threads, memory), the task of picking the right parameters can become difficult. Should I throw in more threads to save time? Should I use fewer threads to save ram? Should i divide the inputs into smaller chunks to stay within a time limit? Wait times are also, sometimes, multiplied if the jobs are resubmitted due to unforeseen errors (or change of parameters).

It helps to understand how the SLURM scheduler makes its decisions when picking the next job to run on a node. The process is not exactly transparent. The good news is that it’s not fully opaque either —  There are hints available.

Selecting resources.

Once a job has been submitted via commands such as sbatch or srun, it will enter the job queue, along with the other thousands of jobs submitted by other fellow researchers. You can use squeue  to see how far back you are in line, but that doesn’t tell you much. For instance, that will not always provide estimated start times.

You most likely already know that the amount of resources you request for your job, e.g. number of nodes, number of cpus, RAM, and wall-clock timelimit, have an influence on your job’s eligibility to run sooner. However, varying the amount of resources can have surprising effects on time spent in the scheduling state (aka state PENDING). The Kicker:

In some cases, increasing requested job resources can lower overall queuing delays.

Compute nodes in the cluster have been statically partitioned by the admins. Varying the resource constraints of your job will change the set of of nodes available to run it somewhat similar to a step function. Each compute node on compute-canada (cedar and graham) is placed in one or more partitions:

  • cpubase_bycore (for jobs wanting a number of cores, but not all cores on a node)
  • cpubase_bynode (for jobs requesting all the cores on a node)
  • cpubase_interac (for interactive jobs e.g., `salloc`)
  • gpubase_… (for jobs requesting gpus)
  • etc.

If you request a number of nodes less than the number of cores available on the hardware, then your job request is waiting for an allocation in the _bycore partitions. If you request all of the cores available, then your job request will wait for an allocation in the _bynode partitions. So based on availability, it might help to ask for more threads, and configure your jobs to process more work at once. The SLURM settings will vary by cluster — the partitions above are for cedar and graham. Niagara, the new cluster, for instance, will only be doing by_node allocations.

On compute-canada: You can view the raw information about all available nodes and their partition with the sinfo command. You can also view the current node availability, by partition using partition-stats.

The following command outputs a table that shows the current number of queued jobs in each partition. If you look in the table closely, you learn that there is only one job in the queue for any node in the _bynode partition (for jobs needing less than 3 hours). The _bycore partition, on the other hand, has a lot of jobs sitting on it patiently. If you tweak your job to make it eligible for the partition with the most availability (often it is the one with the most strict requirements), then you minimize your queuing times.

$ partition-stats -q 

     Number of Queued Jobs by partition Type (by node:by core) 

Node type | Max walltime 
          | 3 hr  | 12 hr   | 24 hr   | 72 hr  | 168 hr | 672 hr  |
----------|--------------------------------------------------------
Regular   | 1:263 | 429:889 | 138:444 | 91:3030| 14:127 | 122:138 |
Large Mem | 0:0   | 0:0     | 0:0     | 0:6    | 0:125  | 0:0     |
GPU       | 6:63  | 48:87   | 3:3     | 8:22   | 2:30   | 8:1     |
GPU Large | 1:-   | 0:-     | 0:-     | 0:-    | 1:-    | 0:-     |
----------|--------------------------------------------------------

“How do I request a _bynode partitioning instead of a _bycore ?”

That is not obvious, and quasi undocumented. The answer is that you do so by asking for all the cpus available on the node. This is done with sbatch --cpus-per-task N. To get the best number of CPUs N, you have to dig a bit deeper and look at the inventory (this is where the sinfo command comes handy). The next section covers this. And this is something that may change over time as the cluster gets upgrades and reconfigurations.

Also, if you request more than one node in your job, each with N CPUs (e.g. sbatch --nodes=3 --cpus-per-task=32 ...), then all of them will be _bynode.

Rules of Thumb

Here are some quick rules of thumb which work well for the state of the cedar cluster, as of April 2018. The info presented in this section was gathered with sinfo, through conversations with compute-canada support, and through experience (over the course of a few weeks). In other words, I haven’t attempted to systematically study job wait times over the course of months, but I will claim that those settings have worked the best so far for my use cases (backed with recommendations from the support team).

  1. Most compute nodes have 32 CPUs installed.
    If you sbatch for N=32 cpus --cpus-per-task=32, you are likely to get your job running faster than if you ask for, say N=16  cpus. If your job requires a low-number of CPUs, then it might be worth exploring options where 32 such jobs are run in parallel. It’s okay to ask for more, but try to use all the resources you ask for, since your account will be debited for them.
  2. Most compute nodes have 128G ram.
    If you keep your job’s memory ceiling under that, you’re hitting a sweet spot. You’ll skip a lot of the queue that way. Next brackets up are  (48core, 192G) (half as many as in the 128G variety), and (32 core, 256G) (even fewer).
  3. Watch out for the --exclusive=user flag.
    The flag --exclusive=user tells the scheduler that you wish your job to only be colocated with jobs run by your user. This is perhaps counter-intuitive, but it does not impose a _bynoderestriction. In the case where your job is determined to be _bynode(i.e., you request enough cpus to take a whole node), this flag is redundant. If you don’t ask for all cores available (meaning your job needs a _bycore allocation), then this flag will prevent jobs from other users from running on that node (on the remaining cpus). In that case, the flag will likely hurt your progress (unless you have many such jobs that can fill that node).
  4. The time limit you pick matters. Try to batch your work in <= 3H chunks.
    A considerable number of nodes will only execute (or favor) tasks which can complete in less than 3H of wallclock time. There’s a considerable larger amount of nodes that will be eligible to you also if you go under the 3H wallclock time limit. The next ‘brackets’ are 12H, then 24H, 72H, 168H (1 wk), and 28d. This suggests that there is no benefit to asking for 1H vs 3H, or 22H vs 18H, although an intimate conversation with the scheduler’s code would confirm that.

But, I want it now.

It should be mentioned that often, if only a single “last-minute” job needs to be run, salloc (same arguments as sbatch, mostly) can provide the quickest turnaround time to start executing a job. It will get you an interactive shell to a node of the requested size within a few minutes of asking for it. There is a separate partition, cpubase_interac, which answers those requests. Again, it is worth looking at the available configurations. Keep salloc it in your back pocket.

 

Streamlined GBS protocol

We already have a GBS protocol on the lab blog, but since it contains three different variants (Kate’s, Brook’s and mine) it can be a bit messy to follow. Possibly because I am the only surviving member of the original trio of authors of the protocol, the approach I used seems to have become the standard in the lab, and Ivana was kind enough to distill it into a standalone protocol and add some additional notes and tips. Here it is!

Simplified GBS protocol 2017

Purging GBS index switching

Quote

Considering the amount of sequencing coming out of the newer Illumina machines, we’ve started to combine GBS libraries with other samples. Due to how GBS libraries are made, when multiplexed with whole genome sequencing samples, there is an appreciable amount of contamination from GBS to WGS. That means you will find GBS reads in your WGS data. I’ve quantified that in the following figure, showing the percent of barcoded reads in WGS samples.

The left side is contamination from barcodes sequenced in different lanes (i.e. ones where they couldn’t contaminate). The right side is barcodes from GBS samples sharing the same lane (i.e. ones that could contaminate. The take home message is that between 1% to 15% of reads are GBS contamination. This is far above what we want to accept so they should be removed.

I’ve written a script to purge barcoded reads from samples. You give it the set of possible barcodes, both forward and reverse (All current barcodes listed here: GBS_2enzyme_barcodes). I’ve been conservative and been giving it all possible barcodes, but you could also trim it to only the barcodes that would be present in the lane. It looks for reads that start with the barcode (or any sequence 1bp away from the barcode to account for sequencing error) plus the cut site. If it finds a barcoded read, it removes both directions of reads. It outputs some stats at the end to STDERR. Note, this was written for 2-enzyme PstI-MspI GBS, although could be rewritten for other combinations.

An example of how you could apply this:

Make a bash script to run the perl script:

input=$1
perl ../purge_GBS_contamination.pl /home/gowens/bin/GBS_2enzyme_barcodes.txt ${input}_R1.fastq.gz ${input}_R2.fastq.gz ${input}.tmp;
gzip ${input}.tmp_R*.fastq

Run this bash script using gnu parallel

ls | grep .gz$ | grep R1 | sed s/_R1.fastq.gz//g | parallel -j 10 bash .decontaminate.sh 2>> decontamination_log.txt

 

 

 

 

 

Seeds of bliss –  גרעינים – بذر

How chewing 10 tons of sunflower seeds brings Arabs and Israelis together?

This started in 2012 and is an ongoing project (see Facebook page here) of an artist called Noam Edry. Her idea is to bring together people from the Middle East through the ancient munching and shell-spitting habit. The project is called “Seeds of Bliss –  גרעינים – بذر“. It uses the sunflower as an emblem of the Middle-East to produce a contemporary art performance (which is a very fancy name for people sitting together to eat sunflower seeds and spit the shells) in outdoor cafes, in a friendly and personal atmosphere in city pairs like Aqaba and Eilat, Nablus and Haifa, Beth-Lehem and Umm El Fahm, Jenin and Afula. They get people from neighbouring cities to travel to their city pair. The challenge is to unite people from all these different cities to munch 10 tons of sunflower seeds!

Photo: Shula Covo, all rights reserved

Check this on the media

It is really comforting to watch the videos of how eating sunflowers can bring together people who’s nations are in conflict.

 

Enjoy!

 

New pictures from Texas field trips

Hello,

I added two new folders from my most recent field trips to Texas (one more to go).They are not as good as Dylan’s pictures but might be useful. Remember we have a Flickr account with :

User name: sunflower_photos

Password: sun flower head
For more information about this Flickr account or how to organize the lab photos, see these posts:

New calendar page

Good News Everyone!

Let me introduce a new lab calendar.

It is accessible from the site Menu and has a simple interface to add a new event:

The plan was to replace a Meetings section flat view and bring more interactivity (for example, you can export this calendar as iCal) so you can easily add new events and maybe avoid additional emails.

I’ve added a few upcoming events and you are welcome to add more if you have something to share with the colleagues!

Filtering TEs from SNP sets

I’ve often wondered if the TEs in the sunflower genome were producing erroneous SNPs. I have unintentionally produced a little test of this. When calling SNPs for 343 H. annuus WGS samples I set FreeBayes to only call variants in non-TE parts of the genome (as per our TE annotations). Unfortunately, I coded those regions wrong, and actually called about 50% TE regions and 50% non-TE regions. That first set was practically unfiltered, although only SNPs were kept. I then filtered that set to select high quality SNPs (“AB > 0.4 & AB < 0.6 & MQM > 40 & AC > 2 & AN > 343 genotypes filtered with: DP > 2”). For each of these, I then went back in and removed SNPs from TE regions, and I’ve plotted totals below:

There are a few things we can take from this:

  • There are fewer SNPs in TE regions than non-TE regions. Even though the original set was about 50/50, about 70% of SNPs were from the non-TE regions.
  • For high quality SNPs, this bias is increased. 80% of the filtered SNPs were from non-TE regions.

Overall, I think this suggests that the plan to only call SNPs in non-TE regions is supported by the data. This has the advantage of reducing the genome from 3.5GB to 1.1GB, saving a bunch of computing time.

High Molecular Weight DNA

3rd generation sequencing technologies (PacBio, Oxford Nanopore) can produce reads that are several tens of Kb long. That is awesome, but it means that you also need to start from intact DNA molecules of at least that size. I thought the DNA extraction method I normally use for sunflower would be good enough, but that is not the case. Continue reading

6% Sephadex G-50 BigDye 3.1 Sequencing Clean-up

   This is a protocol Allan DeBono used when cleaning BigDye 3.1 sequencing reactions for sequencing at NAPS. In both of our hands it worked consistently to confirm vector inserts. An advantage for this protocol is that it takes 5-10 minutes to preform depending on how many reactions are being run (and how fast of a pipetter you are). It also utilizes spin columns that can be re-used after autoclaving.  Continue reading

Electro-Competent EHA-105 Agrobacterium Cells for Transformation

The following protocol is for producing electro-competent cells of agro-bacterium.  Currently A. tumefaciens agropine-type strain EHA-105 (Hood et al.,1995) is used for transformation of sunflower and arabidopsis. Harvesting new competent cells takes 4 days to complete with the 4th day being the most labour intensive. It’s advised to begin early in the morning on the 4th day as the length of culture, purifying, and aliquoting can take several hours.

Continue reading

BRC wifi on the 3rd floor

Many people on the 3rd floor of BRC building told me that their wifi (ubcsecure) keeps getting disconnected. I had the same problem, so I have asked UBC IT services for help.

IT services suggested me to do the following steps:

1)remove ubcsecure network from the wifi option
2)connect to ubcvisitor and go to autoconnect.it.ubc.ca to run the tool
3)ubcsecure is created automatically
4)reconnect to ubcsecure

Now it is working for me.
I hope it would work for everyone else in the building.

Posted in IT

BigDye 3.1

While cleaning up the freezer a few months back we found encrusted in a block of ice at the bottom of the common lab freezer a box with a seizable amount of BigDye 3.1, the reagent used to prepare samples for Sanger sequencing (basically a PCR mastermix with labelled nucleotides). As that is expensive stuff (what we have is worth 3-4000 dollars), I tested it to see if would still work. All the tests I did were rated “great sequence” when I got the results back from NAPS, so the BigDye is fine.

Thanks to a donation of buffer from NAPS, I have diluted some of the BigDye to the same working concentration NAPS uses (1 part of BigDye, 1.5 parts of buffer, 0.5 parts of water). Follow the instructions on the NAPS website (http://naps.msl.ubc.ca/dna-sequencing/dna-sequencing-services/user-prepared/) to prepare your sample, and use 3 µl of the diluted BigDye. There are eight aliquots of about 150 µl each (50 reactions) in a box with a yellow “BigDye” label in the common -20 ºC freezer. If you plan to do only a few reactions at a time, consider making smaller aliquots for your personal use (BigDye doesn’t like repeated freeze-thaw cycles). To avoid confusion I kept the concentrated BigDye and dilution buffer in their Applied Biosystem box on the door of the first freezer on the right in the freezer room. If we run out of dilution just let me know, and I’ll be happy to prepare more.

How to do effective SPRI beads cleaning

SPRI beads cleaning is one of the most repetitively used step during library preparations and probably the step where most of us lose a lot of precious DNAs.

Losing DNA scared me so much (because it can be observable) that I hesitated a lot before trying to use beads to concentrate genomic DNA, because usual rate of recovery are ~50%.

giphy

Hopefully with some practice and a lot of patience, it is possible to reach 90% recovery. How to lose as little DNA as possible? Here are some guidelines: Continue reading