SmartGit GUI Tool

NOTE: This is an old draft post from Thuy (last updated 17 Dec 2012). I’m publishing it because it seems useful and mainly complete. –Brook

What is Git?

Git is a distributed source control version system.  It allows multiple people to work on the same code simultaneously by keeping track of changes made to files.  It visualizes differences between file versions and merges changes from different authors.  It also makes snapshots of file versions, so that you can go back to any version later.  Because git is distributed, you store a copy of the code repository and its change history on your own local machine.  When you are ready, you can sync your files to a remote repository server, such as BitBucket or GitHub.  Syncing to the remote server will share the updated code with all the other users, and they can merge the changes into their own copies if they wish.  Whether or not you use a remote repository server, git will always store your entire repository change history on your local machine.

Continue reading

Estimating Insert Sizes

We recently had some trouble estimating insert sizes with our Mate Pair (aka Jumping, larger insert sizes) Libraries.  All the libraries sequenced by Biodiversity and the Genome Sciences Centre (GSC) were shockingly bad, but the libraries sequenced by INRA were very good.  For example, according to the pipeline, the GSC 10kbp insert size library had an average 236bp insert size, but the INRA 20kb library an average insert size of 20630bp.

See the histogram for the 10kbp library:

Continue reading

C/C++ Dependency Troubleshooting

Unix C/C++ programs are very finicky about the compiler and library versions.  Compiling is the process of translating human readable code into a binary executable or library that contains machine-friendly instructions.  Linking is the process of telling multiple executables or libraries how to talk to each other.

gcc is a GNU C compiler that is typically used on unix systems. g++ is a GNU C++ compiler on unix systems.  libstdc++ is the GNU standard C++ library.   glibc is the GNU standard C library.  When you install GCC on your unix machine, you are installing a package of the aforementioned items.  The gcc unix command is smart enough to call either the gcc compiler or g++ compiler depending if you pass it C or C++ code.

If you attempt to run your program with an older standard library than it was originally linked it with, your program will crash and complain.  Here are some tips to get around it.

Continue reading

C / C++ Troubleshooting

IN PROGRESS

Most of the high-performance bioinformatic programs are written in C or C++.  Unfortunately, C and C++ code is some of the hardest code to debug.  If you have only programmed casually in perl/python, you will not have a good time.  Here are some tips to help you out, but you will most likely need someone with C / C++ programming experience and knowledge of the code to get you through it.

Continue reading

SmartmonTools & GSmartControl

Smartmontools is a command-line Hard Drive Diagnostic Tool that gives you clues on how long your disk has to live. You can run it manually, or you can configure it to periodically test your drives in the background and notify you about test failures via email.

GSmartControl is a GUI for Smartmontools and much easier to use.

Check out this Ubuntu SmartmonTools Tutorial on how to install and set them up.

Here are some tips that are not easily gleaned from the previous websites:

Continue reading

Perl Troubleshooting

This is a collection of fixes for various issues I’ve had with Perl.  Feel free to add any of your Perl tips here.  I will move this to a wiki page if it gets too big.

All Perl scripts fail with error message “Compilation failed in require at…*.pm did not return true at …”

Unable to install packages on Debian with error message “Perl may be unconfigured”

Continue reading

How to Upload Files to Bitbucket (commandline)

BitBucket is an external source code repository server that hosts our shared code.  Our repositories use Git as the version control program to keep track of file changes.  The idea is you make changes to code on your local machine then share your code with everyone else by uploading to BitBucket.

The instructions below guide you step-by-step in uploading files to BitBucket using the commandline.  Git is one of the most popular version control programs but it is not easy for beginners.  If you want to do something that deviates from these steps, consult a git reference.  Once you understand the basics of the git workflow, you can use a GUI program which can combine multiple steps in a single click.

Continue reading

Picard MarkDuplicates Troubleshooting

Picard is a java-based bioinformatics tools suite that handles SAM/BAM files.  Chris introduced me to it, and it’s pretty handy.  However, it is very particular about the SAM format, which can leave you searching the FAQs and forums for quite a while.

The MarkDuplicates tool looks for duplicate reads in your library and either flags them in your SAM/BAM file or deletes them, depending on your settings.  It can also output a metrics on duplication.

If you are doing any analysis which takes coverage into account, you will want to remove PCR duplicates from your libraries so that they do not artificially inflate coverage numbers.  The folks from Celera are adamant about this step, since it can help reduce complexity of genome assembly.

Reads are considered duplicates if they have the same orientation and their 5′ ends map to the same reference coordinate.  Picard does not care if reads are part of a pair and either pair end maps to separate sequence entries (e.g separate scaffolds or contigs or chromosomes) in your reference fastas file.  This makes it better than samtools rmdup, which is unable to handle read pairs that hit separate reference sequence entries.

Reads marked as duplicate will have the 0x400 bit set in SAM flag (column 2).  To check if your read has that bit set, use the Picard Flag Decoder  website.  When duplicate reads are deleted, the highest quality read is kept intact in the SAM/BAM.

Continue reading

Rosalind: Learn Bioinformatics Online Through Problem Solving

I was introduced to Rosalind, a problem-based Bioinformatics Tutorial website, and I think it’s fantastic.  I wish I knew about it when I was first starting out.  You can solve the problems in any language you want.  The website does not run your code.  It only grades your solution dataset.  There are a large number of problems on different topics, from codon-finding to protein spectrum matching.  I would say the problems are geared towards the beginner, but there is enough variety that a higher level bioinformaticist would also benefit from rounding out their knowledge.

SnoWhite Tips and Troubleshooting (Thuy)

Snowhite is a tool for cleaning 454 and illumina reads.  There are quite a few gotchas that will take you half a day to debug.  This wiki has a lot of good tips.

Snowhite invokes other bioinformatics programs, one of them being TagDust.  If you get a segfault error from TagDust, it may be because you are searching for  contaminant sequences larger than TagDust can handle.  TagDust can only handle maximum 1000 characters per line in the contaminant fasta file and maximum 1000 base contaminant sequence lengths.

A segfault (or segmentation fault) happens when a  program accesses the wrong piece of memory.  After TagDust hits the 1000 line character/sequence base limit, TagDust keeps trying to access memory past the 1000 memory slots it has allocated.  It may try to access non-existent memory locations or off-limits memory locations.  You need to edit the TagDust source  code so it allocates enough memory for the sequences and does not wander into bad memory locations.

  • Go into your TagDust source code directory and edit file “input.c”.
  • Go to line 68:

char line[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of characters in the longest line in your contaminant fasta file.  You probably can skip this step if you are using the NCBI UniVec.fasta files, since the default of 1000 is enough.
  • Go to line 69:

char tmp_seq[MAX_LINE];

  • Change MAX_LINE to a number larger than the number of bases in the longest contaminant sequence in your contaminant fasta file.  I tried 1000000 with a recent NCBI UniVec.fasta file and it worked for me.
  • Recompile your TagDust source code
    • Delete all the existing executables by executing  make clean in the same directory as the Makefile
    • Compile all your files again by executing make clean in the same directory as the Makefile
    • If you decided to allocate a lot of memory to your arrays, and your program requires > 2GB of memory at compile time, you may run into “relocation truncated to fit: R_X86_64_PC32 against symbol” errors during linkage.  This occurs when the compiler is unable to allocate enough space for the program’s statically allocated objects.  Edit the Makefile so that

CC = gcc
becomes
CC = gcc -mcmodel=medium