How to hit the ground running with GATK on compute-canada

This article covers the steps needed to get to the point where you can run GATK commands on compute-canada. It will also provide recommendations on how to organize your files, achieve proper versioning, and request compute resources to run the command

as a precondition, you should:

have an account on compute canada
have a terminal/console application from which you can login to compute-canada, i.e. with ssh (either with password-based, or public-key based authentication)

We would suggest you also read other complementary articles on working with the terminal and command line tools, here (fixme update link).

For the purpose of this article, we’ll assume our goal is to produce an index file for a multi-GB VCF file.

There are many ways to do this. But GATK provides tool IndexFeatureFilefor this purpose. The tool will accept a VCF file as input (.vcf.gz), and produce a corresponding index file. (.vcf.gz.tbi).

There are other tools that can produce index files for gzipped VCF, namely a combination of bgzip and tabix. Most tools that produce indexes for vcfs produce files that are compatible with one another, but we’ll stick with the gatk brand for now.

We will cover the following steps to gatk properly on the compute cluster:

Login, and allocate a pseudo-terminal
Allocate/reserve a compute node
Load GATK (of the correct version)
1. from the compute-canada provided modules
2. from a “singularity” image
Launch the command. (we’ll cover the interactive way)

1 / Login and allocate a pseudo-terminal.

When you SSH into compute-canada, you are assigned a login node

jslegare@pan:~$ ssh cedar.computecanada.ca
...
[jslegare@cedar5 ~]$

In this case, I ended up on cedar5. If you don’t specify which login node you prefer, you’ll be assigned one randomly (by DNS). You can also specify the precise one you want, which comes in handy when one of them is unresponsive (or is being abused).
You would state a preference by changing the hostname. There are two login nodes (as far as I’m aware) on the cedar cluster, cedar1 and cedar5. You can login to a specific one like so:

ssh cedar1.cedar.computecanada.ca
ssh cedar5.cedar.computecanada.ca
...

Now, we’re expecting our program to run for a while, so we’ll need to run it into a pseudo-terminal. Enter TMUX.

The program tmux buys us the ability to disconnect from cedar without interrupting the programs running on the terminal. It also allows us to run multiple interactive programs from the same terminal “window” and keyboard. It’s a way to “multiplex” our terminal window. I recommend tmux, because it has somewhat intuitive controls and comes with decent configuration out of the box, but you may prefer its competitor, screen.

You can create a new tmux-session, with the command:

# choose a meaningful name
$ tmux new-session -s gatk-stuff

Once it runs, it looks and feels just like a normal terminal, except there’s an extra bar at the bottom (in green), telling you you’re “one-level-deep” into virtual reality. A few new key combos will get recognized once you’re in tmux. Unlike in nested dreams, time doesn’t slow down as you go deeper, unfortunately.

The bar at the bottom tells you a few useful things, like the time, the server name, and the name of the session, i.e. [edit], in this case. You can run commands at the prompt like you would normally on a regular terminal emulator.

At some point you’ll want to “get out” of the tmux session. You can detach from the session with CTRL-B , followed by d. (d for “detach”). That’s control+b, release, then d. Then, you can re-attachto your named session with the command:

tmux attach-session -t gatkstuff

When your session is detached, the programs that you started are still running (they are not paused either. running). You can also disconnect completely from cedar, without affecting your detached session. Just make sure you detach from the session before exiting — typing exit inside a session will terminate the session, just as it would in a normal terminal window.

If you forget the name you gave your session, you may list them (here, 3):

[jslegare@cedar1 ~]$ tmux list-session
 edit: 1 windows (created Tue Feb 19 15:38:44 2019) 377x49
 gatk-stuff: 1 windows (created Wed Feb 20 13:10:57 2019) [80x23]
 run: 1 windows (created Tue Feb 19 15:34:25 2019) [271x70]#

I find that if I don’t touch a session for a while (generally 3-4 days), it will automatically be cleaned up. This happens when either the tmux server reboots, or cron reaps old processes on compute-canada. If you login and you don’t see your old session listed, it may be what happened. It may also be that the session was created on a different login node (cedar1 vs cedar5) — sessions are local to the server on which they were created.

CTRL-Bis the prefix to all tmux combos. That’s how you interact with the terminal. There are a lot of different ones to learn (https://gist.github.com/MohamedAlaa/2961058), but as you use them, they will make it to muscle memory. This is going to sound a little circular, but if you’re using an application which needs the key sequence CTRL-B for something, and you run that application inside tmux, you would send it the key combo CTRL-B CTRL-B. Some other useful combos are:

CTRL-B, ) Go to next session
CTRL-B, ( Go to previous session
CTRL-B, [ Enable scrolling mode. Scroll up and down with arrows, pageup/pagedown. Exit mode with Esc.

You can create “tabs” inside your tmux-sessions, which the tmux manual refers to as “windows”. And you can split the windows vertically and horizontally. You can also nest sessions, which is useful when your jobs involve multiple computers.

Note: It’s possible that your mouse wheel doesn’t quite feel the same when you’re inside tmux, depending on the version running. On some versions, “wheelscroll up” is interpreted as “arrow up”, which can be confusing. There’s a way to configure that, of course.

OK. before the next step, login to cedar, and create a new tmux session (or reattach to a previous one. We’re ready to run commands.

2/ Allocate a Compute Node

You can perform short program operations directly on the login nodes, but as a rule of thumb, any process that takes more than 10 cpu minutes to complete, or that consumes more than 4G of ram, should be run on a compute node that you have reserved. See Running Jobs.

To reserve a compute node, you would use one of the slurm commands, sbatchor salloc. The former will schedule a script to run, and run it in the background, eventually. The eventually is a big word which can mean anywhere from 10 seconds to 3 days. The latter, salloc,will allow you to interact directly with your reservation when it’s ready — it performs some terminal magic for you and places your job at the foreground in your terminal.

salloc is most useful when you’re experimenting with programs, like gatk in this case, and your work will involve a lot of trial and error. E.g. when you’re to nail down precisely which commands to add to your scripts. Another advantage of salloc over sbatchis that the node reservations you make sallocare considered “interactive”, and fall in a different wait queue of compute-canada. C-C reserves many nodes just for “interactive” work, so there is usually much less of a wait time to get your job running (usually 1min). However if you have 1000 instances of a script to run, sbatch will be the preferred way.

Example command to get an interactive node:

[jslegare@cedar1 test]$ salloc \
         --time 8:00:00 \
         --mem 4000M \
         -c 2 \
         --account my-account
 salloc: Granted job allocation 17388920
 salloc: Waiting for resource configuration
 salloc: Nodes cdr703 are ready for job
 [jslegare@cdr703 test]$ ... get to work ...

Time limit for the task: --time hh:mm:ss. Keep it under 24 h for interactive nodes.
CPUs: -c 2. The number of cpus (aka cores) to give to your task. This is the number of programs that can run “in parallel” simultaneously. Programs concurrently running share the available CPUs in time slices.
Memory: Memory--mem <N>G(GiB) or --mem <N>M(MiB). Total max memory in Gigabytes or Megabytes per task. This is dependent on the nature of the work you’ll do. 4Gwould be plenty for indexing a VCF. A rule of thumb is to give 2000M per cpu.
Account: --account my-account. This is how you choose which account will be billed for the utilization. Omit this parameter to see the list of accounts you can use. Not all accounts are equal, some will offer priority in the queues.
There are many more parameters you could use. Do man sallocto get the full list.

It is recommended that you specify the memory in megabytes, not gigabytes. If you ask for, e.g. 128G, you’re essentially asking for 131072M(128*1024M). Hardware nodes available which have a nominal 128GB of ram installed (and there’s a lot of them with this hardware config) won’t be able to satisfy a request for 128G(because some of it it reserved), but would be able to satisfy a request for 128000M.
Also, if you’re trying to have the queue on your side, you want to always be asking for fewer resources than everyone else — and the least amount possible that gets the job done. That’s a golden rule on c-c.
Paraphrasing, from:Running Jobs

Once the salloc returns with satisfaction, you will get a shell prompt. Notice the hostname prefix changing on the command prompt, from cedar1(the login node), to cdr703(the compute node). The full address for compute nodes is:

<nodename>.int.cedar.computecanada.ca

This is useful to know because as long as you have a task running on a node, you can ssh into it, (by first hopping into cedar login nodes). This is true for jobs allocated both with sbatch, and salloc. You might need to have public key authentication setup for that to work however: compute nodes do not allow password login, as far as I know.

Logging in the compute node allows you to inspect the state of processes on that node while your task is running (e.g. with top), or inspect files local to the job

3 / Load GATK

We have an interactive node going, and now we’re ready to run some gatk commands. We need to modify our default paths to point to the particular gatkversion we want. I’m going to show two ways of doing it:

using the “modules” provided by compute-canada
using versions available online, in docker/singularity images.

Compute-canada, packages software in “modules” which can be added to the set of binaries you can run from the command line. In your environment you can load one version of gatk modules loaded at a time. It’s easy, but it’s not so convenient if you depend on different versions in the same script.

Method 1: using modules

To use GATK using slurm modules, you first need to:

$ module load gatk

# print which version we get
$ gatk --version

Running:
     java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/gatk/4.1.0.0/gatk-package-4.1.0.0-local.jar --version
 The Genome Analysis Toolkit (GATK) v4.1.0.0
 HTSJDK Version: 2.18.2
 Picard Version: 2.18.2

$ ... gatk commands ...

At this point, the program gatkis on your PATH, and you can call it whichever way you like. The one thing to remember, is that the next time you open a session, you will have to load the module again.

Reproducibility: I highly recommend that you double-check the version of the gatk you are loading if you are depending on reproducibility. It’s hard to know when compute-canada staff may decide to change the version on you.
For instance, as a crude check in a script, you could add the following to ensure that gatk is there and its version is exactly 4.1.0.0. Six months from now, when you run your script again, you won’t have to remember which version you used:
gatk --version 2>/dev/null | grep -q "v4.1.0.0" || { echo "bad gatk version" >&2; exit 1; }

Method 2: using singularity containers

My favourite method of running software that is version-sensitive, like GATK, is to not deal with slurm modules. I select the exact version of software I want from the web, and run it inside a singularity “container”. Think of a container as an archived and versioned ecosystem of files and programs. The authors of GATK publish tagged official versions of the software already installed in a minimal environment. All the gatk tools and the correct version of its dependencies (bash, perl, R, picard, etc.) are packaged together.

Other advantages are that it allows you to run multiple versions of gatk in the same session, and that it makes the software versions you need explicit in your scripts. Refer to this page for more info later: https://docs.computecanada.ca/wiki/Singularity

The list of GATK versions available to you are listed on docker hub, in the broadinstitute’s channel. Every new version of GATK will follow quickly with a new entry on the dockerhub channel.

You download a particular container of your choice (aka “pulling” a container), and then you run it:

[jslegare@cdr333 ~]$ singularity pull docker://broadinstitute/gatk:4.1.0.0
WARNING: pull for Docker Hub is not guaranteed to produce the
WARNING: same image on repeated pull. Use Singularity Registry
WARNING: (shub://) to pull exactly equivalent images.
Docker image path: index.docker.io/broadinstitute/gatk:4.1.0.0
Cache folder set to /home/jslegare/.singularity/docker
Importing: base Singularity environment
...
Exploding layer: sha256:6ce2cf3c2f791bd8bf40e0a351f9fc98f72c5d27a96bc356aeab1d8d3feef636.tar.gz
Exploding layer: sha256:2654b1d68e7ba4d6829fe7f254c69476194b1bdb0a6b92f506b21e2c9b86b1dc.tar.gz
WARNING: Building container as an unprivileged user. If you run this container as root
WARNING: it may be missing some functionality.
Building Singularity image...
Singularity container built: ./gatk-4.1.0.0.simg

Note the warning about reproducibility above. Technically the broadinstitute could overwrite the container with the tag 4.1.0.0 on the remote repository, which means you would be downloading a different container the next time around. But that would be extremely bad form. — akin to re-releasing new code under an old version. Some tags, like “latest” are meant to change. Others, with specific version numbers, are typically immutable, by convention.
more info about the goal here:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5706697/

Singularity saved the image (gatk-4.1.0.0.simg) in the same folder. These images are typically big (1-4GB), and they’re better off placed in a common location, for reuse.

Once you have the image, you have access to all the programs inside it:

[jslegare@cdr333 ~]$ singularity exec ./gatk-4.1.0.0.simg gatk -version
 Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
 Running:
     java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.0.0-local.jar -version
 The Genome Analysis Toolkit (GATK) v4.1.0.0
 HTSJDK Version: 2.18.2
 Picard Version: 2.18.25

[jslegare@cdr333 ~]$ singularity exec ./gatk-4.1.0.0.simg ls -ld /scratch/
 ls: cannot access '/scratch/': No such file or directory

Notice however that the second command fails. This is because the container doesn’t contain a /scratchdirectory. If you wish to access a folder from the host computer from inside the container environment, you must use “bind mounts”.

[jslegare@cdr333 ~]$ singularity exec -B /scratch:/scratch ./gatk-4.1.0.0.simg ls -ld /scratch/
 drwxr-xr-x 20086 root root 1007616 Feb 28 23:17 /scratch/

The syntax is: -B <src>:<dst>. It makes directory or file <src> available at location <dst>inside the container environment. You can specify multiple -Bflags as well, for instance -B /scratch:/scratch -B /project:/project. But they need to be placed after the execkeyword, because they’re an option of the exec singularity subcommand. Also, your home folder /home/<user>/is mounted by default as a convenience, without having to specify anything.

If you’re not sure where programs are in a container, and you want to kick the tires, you can typically open an interactive shell session inside it (with bash, or ash, or sh):

[jslegare@cdr333 ~]$ singularity exec ./gatk-4.1.0.0.simg /bin/bash
 bash: /root/.bashrc: Permission denied
# inside the container
 jslegare@cdr333:/home/jslegare$ ls /home
 jslegare
 jslegare@cdr333:/home/jslegare$ ls /project
 ls: cannot access '/project': No such file or directory
 jslegare@cdr333:/home/jslegare$ which gatk
 /gatk/gatk
 jslegare@cdr333:/home/jslegare$ exit
 exit
# container is done. we're outside again
[jslegare@cdr333 ~]$ ls -ld /project
 drwxr-xr-x 6216 root root 675840 Feb 28 13:17 /project

Modifications to any files inside the container get reverted when the container exits. This lets your programs start fresh each time. But if you create new files inside a container, make sure they are placed in a folder that you have “bound” and is visibly accessible outside the container, such as your scratch or somewhere under your home folder. It’s also a good practice to mount the strict minimum number of folders needed to the container — it will limit the potential damage mistakes can do.

4/ Indexing a VCF

Ok, the hard part is done. Now we can index our VCF. GATK will have to read our entire VCF file, which come typically in .vcf.gzcompressed form, and then output the corresponding .vcf.gz.tbiin the same folder.

[jslegare@cedar1 my_vcf]$ pwd
 /scratch/jslegare/my_vcf
 [jslegare@cedar1 my_vcf]$ ls -lh
 -rw-rw-r-- 1 jslegare jslegare 14G Feb 28 16:56 my_file.vcf.gz

Above I’m showing you the vcf file. There are three things to notice:

It’s huge (14GB).
There is no accompanying index, which means that any vcf tool will take its sweet time to do anything with it.
it’s in my /scratch . this matters.

Placement tip 1: When we’re reading and writing large files, it’s faster to do it on a fast filesystem. If you have the choice, I recommend running the commands somewhere under /scratch/<myusername>/. Your /home/<myusername>and /project/foo/<myusername>will do, in a pinch, but they have higher-latency, smaller capacity, and they’re being constantly misused by hundreds of other users — which means they’re busy, to the point where they’re barely usable at all sometimes. That said, if the VCF you give is already somewhere outside /scratch/, it might not be worth your while to copy it to /scratch, and then move it back. But presumably, this is not the only command you’ll run on that VCF, so a copy might pay off eventually.

Placement tip 2: When you’re on a compute node (not the login node), there’s a special location that might even be faster than scratch, depending on your program’s access patterns. It’s a location that is backed by an SSD drive, local to the node. Whatever you store there is only visible inside that node, and it gets cleaned up when your task finishes or expires. It’s placed in the special variable: $SLURM_TMPDIR. You have about 100GB or so that you can use there, but you can ask for more in the sbatch or salloc command.

Placement tip 3: This is more advanced stuff, but if the disk is too slow for you, you can also create files and folders inside `/dev/shm`. The files you place there are backed by RAM, not the disk, and their total size count towards your job’s maximum memory allocation. In my experience however, gatk workloads don’t really benefit from that edge, because of the linear data formats. So stick with the previous two placement tips for basic gatk stuff, but keep this trick in your back pocket for your own tools.

Ok, the command we need is IndexFeatureFile, and we run it like so:

[jslegare@cdr333 my_vcf]$ singularity exec -B /scratch:/scratch docker://broadinstitute/gatk:4.1.0.0 gatk IndexFeatureFile -F my_file.vcf.gz
...
01:14:04.170 INFO  ProgressMeter - Ha412HOChr04:134636312              1.3                909000         677978.7
...
01:19:35.490 INFO  ProgressMeter - Ha412HOChr17:49367284              6.9               4672000         680776.7
01:19:54.254 INFO  IndexFeatureFile - Successfully wrote index to /scratch/jslegare/my_vcf/my_file.vcf.gz.tbi
 01:19:54.255 INFO  IndexFeatureFile - Shutting down engine
 [March 1, 2019 1:19:54 AM UTC] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 7.21 minutes.
 Runtime.totalMemory()=1912078336
 Tool returned:
 /scratch/jslegare/my_vcf/my_file.vcf.gz.tbi

That took 7.21 minutes to produce that tbi from a 14GB vcf.gz, and it claims to have needed 1.9GB of virtual memory to do it. A slight improvement to the basic command would be to give gatk a temp directory under $SLURM_TMPDIR, via the --tmp-dirargument. But insider gatk knowledge will reveal that this particular GATK command doesn’t cause much tempfile action.

Note: If you’ve loaded gatk with the slurm modules, then the only thing that changes is that you need to first runmodule load gatk. The command you run is then the same as above, without all the singularity command jazz.

Note 2: You may have noticed too that I ran singularity execon the full docker URL rather than the .simgfile you get from docker pull. It’s a bit slower than giving it the .simgbecause it has to “prepare” the image before it can run it. This involves downloading a few filesystem layers from the web (if it sees them for the first time), and copying things around. This process is accelerated if you link your singularity cache (~/.singularity/docker) to a location under /scratch/<username/.

Rieseberg Lab Resources

RLR: Technical resources for Rieseberglers