SmartGit GUI Tool

Posted on March 6, 2015 by Thuy

NOTE: This is an old draft post from Thuy (last updated 17 Dec 2012). I’m publishing it because it seems useful and mainly complete. –Brook

What is Git?

Git is a distributed source control version system. It allows multiple people to work on the same code simultaneously by keeping track of changes made to files. It visualizes differences between file versions and merges changes from different authors. It also makes snapshots of file versions, so that you can go back to any version later. Because git is distributed, you store a copy of the code repository and its change history on your own local machine. When you are ready, you can sync your files to a remote repository server, such as BitBucket or GitHub. Syncing to the remote server will share the updated code with all the other users, and they can merge the changes into their own copies if they wish. Whether or not you use a remote repository server, git will always store your entire repository change history on your local machine.

Continue reading →

Pipeline Integrating Genome Assembly, Physical Maps, & Genetic Maps

Posted on August 16, 2013 by Thuy

Everything you need should be checked into BitBucket

BitBucket Sunflower Genome Repository

BitBucket FPC Repository

Continue reading →

Estimating Insert Sizes

Posted on August 12, 2013 by Thuy

We recently had some trouble estimating insert sizes with our Mate Pair (aka Jumping, larger insert sizes) Libraries. All the libraries sequenced by Biodiversity and the Genome Sciences Centre (GSC) were shockingly bad, but the libraries sequenced by INRA were very good. For example, according to the pipeline, the GSC 10kbp insert size library had an average 236bp insert size, but the INRA 20kb library an average insert size of 20630bp.

See the histogram for the 10kbp library:

Continue reading →

C/C++ Dependency Troubleshooting

Posted on July 25, 2013 by Thuy

Unix C/C++ programs are very finicky about the compiler and library versions. Compiling is the process of translating human readable code into a binary executable or library that contains machine-friendly instructions. Linking is the process of telling multiple executables or libraries how to talk to each other.

gcc is a GNU C compiler that is typically used on unix systems. g++ is a GNU C++ compiler on unix systems. libstdc++ is the GNU standard C++ library. glibc is the GNU standard C library. When you install GCC on your unix machine, you are installing a package of the aforementioned items. The gcc unix command is smart enough to call either the gcc compiler or g++ compiler depending if you pass it C or C++ code.

If you attempt to run your program with an older standard library than it was originally linked it with, your program will crash and complain. Here are some tips to get around it.

Continue reading →

C / C++ Troubleshooting

Posted on July 24, 2013 by Thuy

IN PROGRESS

Most of the high-performance bioinformatic programs are written in C or C++. Unfortunately, C and C++ code is some of the hardest code to debug. If you have only programmed casually in perl/python, you will not have a good time. Here are some tips to help you out, but you will most likely need someone with C / C++ programming experience and knowledge of the code to get you through it.

Continue reading →

SmartmonTools & GSmartControl

Posted on July 23, 2013 by Thuy

Smartmontools is a command-line Hard Drive Diagnostic Tool that gives you clues on how long your disk has to live. You can run it manually, or you can configure it to periodically test your drives in the background and notify you about test failures via email.

GSmartControl is a GUI for Smartmontools and much easier to use.

Check out this Ubuntu SmartmonTools Tutorial on how to install and set them up.

Here are some tips that are not easily gleaned from the previous websites:

Continue reading →

Text Editor Indenting & Highlighting

Posted on June 27, 2013 by Thuy

Do you hate reading code that is poorly indented or lacks syntax highlighting? Here are some common text editor fixes:

Continue reading →

Perl Troubleshooting

Posted on May 29, 2013 by Thuy

This is a collection of fixes for various issues I’ve had with Perl. Feel free to add any of your Perl tips here. I will move this to a wiki page if it gets too big.

All Perl scripts fail with error message “Compilation failed in require at…*.pm did not return true at …”

Unable to install packages on Debian with error message “Perl may be unconfigured”

Continue reading →

How to Upload Files to Bitbucket (commandline)

Posted on May 10, 2013 by Thuy

BitBucket is an external source code repository server that hosts our shared code. Our repositories use Git as the version control program to keep track of file changes. The idea is you make changes to code on your local machine then share your code with everyone else by uploading to BitBucket.

The instructions below guide you step-by-step in uploading files to BitBucket using the commandline. Git is one of the most popular version control programs but it is not easy for beginners. If you want to do something that deviates from these steps, consult a git reference. Once you understand the basics of the git workflow, you can use a GUI program which can combine multiple steps in a single click.

Continue reading →

Picard MarkDuplicates Troubleshooting

Posted on February 22, 2013 by Thuy

Picard is a java-based bioinformatics tools suite that handles SAM/BAM files. Chris introduced me to it, and it’s pretty handy. However, it is very particular about the SAM format, which can leave you searching the FAQs and forums for quite a while.

The MarkDuplicates tool looks for duplicate reads in your library and either flags them in your SAM/BAM file or deletes them, depending on your settings. It can also output a metrics on duplication.

If you are doing any analysis which takes coverage into account, you will want to remove PCR duplicates from your libraries so that they do not artificially inflate coverage numbers. The folks from Celera are adamant about this step, since it can help reduce complexity of genome assembly.

Reads are considered duplicates if they have the same orientation and their 5′ ends map to the same reference coordinate. Picard does not care if reads are part of a pair and either pair end maps to separate sequence entries (e.g separate scaffolds or contigs or chromosomes) in your reference fastas file. This makes it better than samtools rmdup, which is unable to handle read pairs that hit separate reference sequence entries.

Reads marked as duplicate will have the 0x400 bit set in SAM flag (column 2). To check if your read has that bit set, use the Picard Flag Decoder website. When duplicate reads are deleted, the highest quality read is kept intact in the SAM/BAM.

Continue reading →

Rosalind: Learn Bioinformatics Online Through Problem Solving

Posted on February 8, 2013 by Thuy

I was introduced to Rosalind, a problem-based Bioinformatics Tutorial website, and I think it’s fantastic. I wish I knew about it when I was first starting out. You can solve the problems in any language you want. The website does not run your code. It only grades your solution dataset. There are a large number of problems on different topics, from codon-finding to protein spectrum matching. I would say the problems are geared towards the beginner, but there is enough variety that a higher level bioinformaticist would also benefit from rounding out their knowledge.

How to Set Up Authentication for Sendmail SMTP server

Posted on October 25, 2012 by Thuy

Sendmail is a mail server that supports many different protocols including SMTP. SMTP protocol handles delivering emails (but not receiving them).

This article details how to setup Plain authentication on a sendmail server on Fedora.

Continue reading →

SnoWhite Tips and Troubleshooting (Thuy)

Posted on May 18, 2012 by Thuy

Snowhite is a tool for cleaning 454 and illumina reads. There are quite a few gotchas that will take you half a day to debug. This wiki has a lot of good tips.

Snowhite invokes other bioinformatics programs, one of them being TagDust. If you get a segfault error from TagDust, it may be because you are searching for contaminant sequences larger than TagDust can handle. TagDust can only handle maximum 1000 characters per line in the contaminant fasta file and maximum 1000 base contaminant sequence lengths.

A segfault (or segmentation fault) happens when a program accesses the wrong piece of memory. After TagDust hits the 1000 line character/sequence base limit, TagDust keeps trying to access memory past the 1000 memory slots it has allocated. It may try to access non-existent memory locations or off-limits memory locations. You need to edit the TagDust source code so it allocates enough memory for the sequences and does not wander into bad memory locations.

Go into your TagDust source code directory and edit file “input.c”.

Go to line 68:

char line[MAX_LINE];

Change MAX_LINE to a number larger than the number of characters in the longest line in your contaminant fasta file. You probably can skip this step if you are using the NCBI UniVec.fasta files, since the default of 1000 is enough.
Go to line 69:

char tmp_seq[MAX_LINE];

Change MAX_LINE to a number larger than the number of bases in the longest contaminant sequence in your contaminant fasta file. I tried 1000000 with a recent NCBI UniVec.fasta file and it worked for me.

Recompile your TagDust source code
- Delete all the existing executables by executing make clean in the same directory as the Makefile
- Compile all your files again by executing make clean in the same directory as the Makefile
- If you decided to allocate a lot of memory to your arrays, and your program requires > 2GB of memory at compile time, you may run into “relocation truncated to fit: R_X86_64_PC32 against symbol” errors during linkage. This occurs when the compiler is unable to allocate enough space for the program’s statically allocated objects. Edit the Makefile so that

CC = gcc
becomes
CC = gcc -mcmodel=medium

Reference: http://www.obihai.org/2010/05/relocation-truncated-to-fit-rx866432s.html

Rieseberg Lab Resources

RLR: Technical resources for Rieseberglers

Author Archives: Thuy

SmartGit GUI Tool

Pipeline Integrating Genome Assembly, Physical Maps, & Genetic Maps

Estimating Insert Sizes

C/C++ Dependency Troubleshooting

C / C++ Troubleshooting

SmartmonTools & GSmartControl

Continue reading →

Text Editor Indenting & Highlighting

Continue reading →

Perl Troubleshooting

How to Upload Files to Bitbucket (commandline)

Picard MarkDuplicates Troubleshooting

Rosalind: Learn Bioinformatics Online Through Problem Solving

How to Set Up Authentication for Sendmail SMTP server

SnoWhite Tips and Troubleshooting (Thuy)