Over the years, gene sequences have been obtained from genome sequencing projects, allelic variation studies, environmental sampling and a multitude of individual studies. Sequence data for many genomes remains available from a website dedicated to that genome project. Initially repositories were available for the major sequencing projects, and individual data was available from smaller projects by request. During this early period of the biological information age, it became apparent that a more centralized solution would be required.
While obtaining sequence data from genome repositories obtains the most recent gene predictions and function annotations, but is time consuming when performing comparative analysis. In order to reduce search time for researchers, a number of handy central repositories have been established, the most prominent of those being NCBI's Non-Redundant database, UniProt, PIR, and PDB. These central repositories have the advantage of allowing users to search all known sequences, but are often a version or two behind the most curated genomes, and consequently have slightly inferior gene boundary predictions and annotations. They also often suffer from redundancy.
Each of these databases can be downloaded, or accessed through a web-portal. A video example demonstrates how a sequence can be obtained from NCBI's NR, and then identify homologous structures in PDB. A separate example shows how to obtain all sequences from the human genome.
Talk about how local blast databases can be assembled.
Introduction
Hello and welcome to Pragmatic Bioinformatics; a source of practical tool review, handy code, and general tips for budding and seasoned computational biologists alike.
Saturday, February 2, 2008
Tuesday, January 29, 2008
Datatypes: Know your Ingredients
A section dedicated to describing the most common datatypes: FASTA for sequence, PDB for structure, MSA and HMM modifications to FASTA, domain output, blast output, blast databases, annotation data, etc.
Sunday, January 27, 2008
BLAST Hacks
MSA: Multiple Sequence Alignment

A Post about multiple sequence alignment construction and analysis
http://en.wikipedia.org/wiki/Multiple_sequence_alignment
paper: 2007 recent evolution of multiple sequence alignment programs
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1963500
http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371%2Fjournal.pcbi.0030123&ct=1
T-Coffee Tree-based Consistency Objective Function For alignment Evaluation (progressive)
http://www.tcoffee.org/
Muscle multiple sequence alignment by log-expectation
http://www.drive5.com/muscle/
Clustalw
http://www.ebi.ac.uk/Tools/clustalw2/index.html
Mafft MSA algorithm based on fast Fourier. transform
http://align.bmr.kyushu-u.ac.jp/mafft/software/source.html
Show how to download and install each
Show how to run each
publish a little perl script that is able to compare the results of all four, for establishing consistency between the four: number of columns in agreement between the submitted msas
Subscribe to:
Posts (Atom)




