Sequence similarity searching using NCBI BLAST
Contents of this tutorial:
What is BLAST?
BLAST (Basic Local Alignment Search Tool) is a set of programs designed to perform similarity searches against a database of sequences. Scientists frequently use such searches to gain insight into evolutionary relationships and use that to infer function and biological importance of gene products. BLAST uses an algorithm that seeks out local alignment (the alignment of some portion of two sequences) as opposed to global alignment (the alignment of two sequences over their entire length). By searching for local alignments, BLAST is able to identify regions of similarity within two sequences.
Some BLAST search services include the following:
- blastp – comparing an amino acid query sequence with others stored in protein sequence databases
- blastn – comparing a nucleotide query sequence against a nucleotide sequence database
- blastx – comparing a nucleotide query sequence translated in all reading frames with other amino acid sequences stored in protein sequence databases
Which type of BLAST search should you use?
Since more than one codon or triplet of nucleotides could code for a particular amino acid, a considerable variation in nucleotide sequences could translate into the same amino acid sequence. Comparing amino acid sequences is a more reliable predictor of similarity between two sequences than comparing nucleotide sequences. For this reason, this tutorial will focus on using blastp to compare the gene product’s amino acid sequence with other.
Obtaining a FASTA Formatted Amino Acid Sequence
As a shortcut, we will use NCBI’s Entrez Gene to quickly access the amino acid sequence of a gene product. The amino acid sequence also could be obtained by searching protein sequence databases such as NCBI’s Entrez; this process, however, can be more involved and rather time-consuming since it often requires examining and sifting through several sequence records. For this exercise we will use the human hemochromatosis protein.
- Go to the Entrez Gene Web site:Entrez Gene
- Make sure the top pull down box is set to Gene. Enter a gene name in the query box. Using the field qualifier [Gene Name] to restrict your query tells Entrez Gene that you are searching by gene symbol only. Since a gene name is unique for each human gene, you should retrieve only one result per species. Otherwise, the search will return results that mention the query term anywhere in the record. The gene name for the hereditary hemochromatosis gene is HFE.
For more information on options for refining your search, see Gene Help.
- In the search box at the top of the Gene home page: enter “HFE[Gene Name]” (no quotes) in the query box Click SEARCH and note the results. From here you can select one of the results or proceed to the next step.
- Next add “AND Homo sapiens[Organism]” to the search box so that it reads “HFE[Gene Name] AND Homo sapiens[Organism]“. This limits the results to only human sequences. Click SEARCH and note the results. As this was the only result returned the system automatically opens the full record. Scroll down and read though the data associated with this gene.
Scroll down to the section titled NCBI Reference Sequences (RefSeqs). This section contains the consensus and reviewed sequences for each gene and its products. Many genes may have multiple proteins so there may be more than one protein for a given gene. How each variant differs from the most complete isoform (isoform 1) is written in the description. In the case of HFE there are 9 RefSeq sequences for possible precursors. We will be considering isoform 1 for the rest of this tutorial.
Accession numbers for RefSeq protein sequences begin with NP_; DNA sequences with NC_; mRNA sequences NM_; and gene record with NG_
- Under the sub-section labeled “mRNA and Protein(s)” Click on the link for the protein NP_000401.1. Examine the record.
To display only the sequence click the FASTA link at the top of the page just under “NCBI Reference Sequence: NP_000401.1”.
A sequence in FASTA format consists of a single line of descriptive text that begins with >, followed by sequence data.
- Highlight the entire FASTA sequence (making sure to include the the >, the complete definition line and the entire sequence) with your mouse and copy it by pressing Ctrl + C [Command + C for Macs] on the keyboard or by right-clicking and selecting the copy option.
- Now that you have the amino acid sequence of the human HFE protein in FASTA format, you are now ready to submit this sequence as a BLAST query, which is covered in the next section of this tutorial.
Submitting a Query Sequence
- After you have copied the sequence in FASTA format, access the protein-protein BLAST service at http://www.ncbi.nlm.nih.gov/BLAST/. Go to the section titled “Basic Blast” and click on the link Protein BLAST.
In the section titled “Enter Query Sequence” paste your sequence into the large query box by pressing Ctrl + V o[Command + V for Mac] on the keyboard or by right-clicking inside the search box and selecting the paste option. The pasted sequence in the search box is shown below.
>gi|4504377|ref|NP_000401.1| hereditary hemochromatosis protein isoform 1 precursor [Homo sapiens] MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQDLGLSLFEALGYVDDQLFVFYDHESRRVEP RTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIMENHNHSKESHTLQVILGCEMQEDNSTEGYWKYGY DGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRARQNRAYLERDCPAQLQQLLELGRGVLDQQVPPLV KVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDAKEFEPKDVLPNGDGTYQGWITLAVPPGEEQRY TCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFIGILFIILRKRQGSRGAMGHYVLAERE
For more information about different search and format options, click the help tab at the top of the search page.
- Leave all search options set to their default values. Make sure the Database option is set to Non-redundant (nr). Scroll to the bottom of the page and click the BLAST button. Your search will be entered into a queue and should complete within a minute or two.
The default database setting will automatically search sequence data from many different organisms. You can limit the organism by typing the taxonomic name in the Organism field in the “Choose Search Set” section of the page. The common name of many organisms can also be used (e.g human, rat, mouse).
Understanding BLAST results
- Scrolling through the BLAST results, you will see that it includes a unique request ID (RID), query information, database information, a link to taxonomy reports, a graphical display showing alignments to the query sequence, descriptions of sequences producing significant alignments, and pairwise alignments between the query sequence and each BLAST hit sequence.
- The main purpose of sequence comparison is to infer homology (evolutionary relationship by descent) and from that infer shared function. To assist with this NCBI developed the Taxomony view of BLAST results. Clicking on Taxonomy reports just above the Graphical Display will open a new browser window that displays BLAST results in three different views: Organism Report, Lineage Report, and Taxonomy Report. Organism Report groups all hits by organism. For example, in this report the Homo sapiens are clustered together at the top of the list followed by other primates, then larger mammals, then rodents, etc. You can also view distance trees, look at conserved regions and perform multiple sequence alignments with the results from this search.
- The next section is a graphical overview of the conserved domains.
Next is a graphical overview showing the alignment of other sequences to your query. The thick red numbered bar at the top represents the query sequence, and the numbers correspond to those of amino acid residues.
- All hits are represented by colored bars below the query sequence. Mousing over a hit will display its definition and score in the text box above the graphical display. Clicking on a hit will take you to the pairwise alignment between hit and query sequence.
- The bar color for a hit refers to alignment score, a mathematically derived value that reflects the degree of similarity between hit and query sequences. The higher the score, the more similar the two. The Color Key at the top of the graphical display gives the range of alignment scores assigned to each color. For example, red hits are most similar, with alignment scores greater than or equal to 200, while black hits are least similar, with alignment scores lower than 40.
Below the graphical overview is a text summary of the sequences with the highest scores (best alignments) hits (database sequences retrieved during BLAST search) with the query sequence. These are the descriptions of the statistically significant alignments. The most significant alignments are at the top.
- Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence).
- E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment and the less likely this alignment is to occur simply by chance.
- U G M – These links provide the user with direct access from BLAST results to related entries in other databases.
- Below the descriptions are pairwise alignments that show the entire length of each hit sequence matched up with the entire query sequence. With a pairwise alignment you can see how the hit sequence compares with the query sequence amino acid by amino acid. The screen shot below is the pairwise alignment for the first hit.
- The hit sequence is presented in the Sbjct: line, and the query sequence in the Query: line.
- Each letter between the Subject and Query lines indicates that the amino acids at that position in both sequences are identical. Each blank space between the Subject and Query lines means that amino acids at the specified position in both sequences do not match.
- X’s are inserted into the query sequence as a result of automatic filtering. A string of X’s is used to replace a sequence’s low-complexity regions that can generate artifactual hits. In nucleotide sequences, N’s replace low-complexity regions rather than X’s.
- Dashes inserted into either query or subject sequence indicate gaps introduced to compensate for insertions and deletions.
Adapted from: Oak Ridge National Labs