BLAST databases (Seb)

I’ve uploaded two BLAST protein databases one nucleotide database here: /Linux/Loren/blast_database

Please look at theĀ  README file for more info and append to it if your modify things or add databases.

ncbi_nucleotide_nt database: All GenBank + EMBL + DDBJ + PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences).

uniprot: contains manually annotated protein and reviewed protein sequences (see http://www.ebi.ac.uk/uniprot/)

ncbi_protein_nr: Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr.

There is more information about all the databases here:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide

Here are some example on how to run BLAST from the command line on the cluster: (there are probably some typos/mistakes I still need to correct. It’s a work in progress).

Examples:
### DOWNLOAD DATABASE ###
#this is to download databases from NCBI
#the database is in *tar.gz format which needs to be unzipped like this: tar -xvzf myfile.tar.gz
update_blastdb.pl nr #download the protein nr database

### MAKE A REFERENCE DATABASE ###
#this is to generate your own reference database from your own fasta file (either protein or nucleotide)
makeblastdb -dbtype nucl -in genes1_fas_pathway.txt
makeblastdb -dbtype prot -in Plantcyc_Enzymes_Without_Tags_BLASTset.fasta

### BLASTn ###
#BLASTn (nucleotide against nucleotide) with discontiguous megablast, usually for distant interspecific relationships.
#Mininum e-value to report result set as 1e-10
#report a maximum of 100 target sequences
#output format is in XML format (5), (tabular ouput: 6, 0: “web-like” output, default)
#blast is run on 12 CPUs simultaneously
blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 100 -query sequences_fasta/seq -db /Linux/Loren/blast_database/ncbi_nucleotide_nr/nr -num_threads 12 -outfmt 5 -out blast_results/seq

### BLASTn ###
#tabular ouput (6)
blastn -task dc-megablast -evalue 1e-10 -max_target_seqs 5 -query sequences_fasta/genetic_map_seqs.nocolons.fasta -db database/ara_pathway/genes1_fas_pathway.txt -num_threads 4 -outfmt 6 -out blast_results/pathways

### BLASTn – short sequences ###
#short sequences option
#run it with a nohup command
nohup blastn -task blastn-short -evalue 1e-5 -max_target_seqs 20 -query tta -db ../../blast/database/all_mito_sequences_e50 -num_threads 8 -outfmt 6 -out blast_physical_map_against_all_mito > log &

### BLASTx with nohup###
#Blastx translate your sequence in all 6 possible framework and searches a protein databases.
nohup blastx -evalue 1e-10 -max_target_seqs 5 -query sequences_fasta/genetic_map_seqs.nocolons.fasta -db database/ara_pathway/Plantcyc_Enzymes_Without_Tags_BLASTset.fasta -num_threads 6 -outfmt 6 -out blast_results/pathways > log &

### BLASTp with gi restriction###
#protein against protein
#gi_viriplantae contains a list of all GI for all viriplantae. This speeds things up a lot since it restrict the search to plants only.
blastp -query tta -db ~/blast/database/ncbi_nr/nr -gilist ~/blast/database/green_plants/gi_viriplantae -outfmt 4 -out tta_2