Overview
The BLAST (Basic Local Alignment Search Tool) tool compares input sequences to PlantGenIE sequence databases to identify homologous sequence matches.
Basic Usage
Simply paste your sequence (with or without a FASTA header) into the Query Sequence input text box. Alternative you can retrieve a transcript sequence by entering a gene ID into the Load example text box, or you can upload a sequence file (Less than 100 MB) using the upload file function. Having used one of these input options, click and select the desired dataset from the lists of available BLAST databases. Finally click the BLAST! button at the bottom of the page.
PlantGenIE BLAST uses standard default NCBI BLAST options. However users can change the following advanced options:
Option | Description |
---|---|
Scoring matrix | Substitution matrix that determines the cost of each possible residue mismatch between query and target sequence. See BLAST substitution matrices for more information. |
Filtering | Whether to remove low complexity regions from the query sequence. |
E-value cutoff | The maximum expectation value of retained alignments. |
Query genetic code | Genetic code to be used in blastx translation of the query. |
DB genetic code | Genetic code to be used in blastx translation of the datasets. |
Frame shift penalty | Out-of-frame gapping (blastx, tblastn only) [Integer] default = 0. |
Number of results | The maximum number of results to return. |
BLAST results
The BLAST Results page will be automatically reloaded until the search results are successfully retrieved. BLAST results are organized into a table containing Query ID, Hit ID, Average bit score (top), Average e-value (lowest), Average identity (av. similarity) and Links. Clickable BLAST results display the corresponding region of identified homology within the GBrowse tool, where the matching region is shown.
Data
The BLAST tool uses public genome assemblies, early release de novo assemblies from UPSC and data from [Phytozome] (http://www.phytozome.net/) and Plaza.
Implementation
PlantGenIE BLAST search is implemented using NCBI Blast (v2.2.26) and a backend PostgresSQL Chado database. We use PHP, JavaScript, XSL, Perl and d3js, Drupal libraries to improve Open Source GMOD Bioinformatic Software Bench server to provide a graphical user interface.
-
The NCBI BLAST family of programs includes:
- blastp
- Compares an amino acid query sequence against a protein sequence database
- blastn
- Compares a nucleotide query sequence against a nucleotide sequence database
- blastx
- Compares a nucleotide query sequence translated in all reading frames against a protein sequence database
- tblastn
- Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
- tblastx
- Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Query sequence
The query sequence to be used for a BLAST search should be pasted in the 'Sequence' text area.It accepts a number of different types of input and automatically determines the format or the input. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). These are described in 3) below. Accepted input types are FASTA, bare sequence, or sequence identifiers .
1.) FASTA
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA
format is:
>lcl|MA_1 len=89935 TGTGTACTCTTGTGATTGTGTTTCTCTCAGTGATCCTATCTATGTTATTGTTGTCTAGTAAATTGAAAGTAACCTAATAA TAGTAGAAACTTTAACACTACAAATGCTTACTAGGTCCAAGAAGAGAATAAGGGTGGAGACCATGGAGGCTTCGACCAAG GAGGCTTCAACAAAGGAGGTTACCAAGGAGGCCAGAGAGGAGGATATGGAAGAGGAAGAGGAAGAGGATATGATGGAGGA GGAAGACCACCTACCTTTAATGGTGGTGAGATAGGCCACTTGTCACGATTTTGTGCCAAGCCGCATGCACCGTGTGGGTA TTTCCCCAACTTCGACCATGTCACCGAGGATTTCCCAAAATTATTGAAAAAATGTGAAGAAAAAAAGGGGCATTGCAACA TGGTGACTGCTAAGTTGATGTACGAGTGGTAACCCAAGGAGGCACCCATATGAGAGTGAAACTAGAACAGGGAGAAGGTT CAAGGAAGAATATAGAAGGAAACATTAGAAAATCATCCCAATAACCTCCTAAGTTTGACAGTGTGCATCATGATTAGTTA
Blank lines are not allowed in the middle of FASTA input.
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are:
A adenosine C cytidine G guanine
T thymidine N A/G/C/T (any) U uridine
K G/T (keto) S G/C (strong) Y T/C (pyrimidine)
M A/C (amino) W A/T (weak) R G/A (purine)
B G/T/C D G/A/T H A/C/T
V G/C/A - gap of indeterminate length
For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:
A alanine P proline B aspartate/asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate/glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length
NOTE:
¹ The degenerate nucleotide codes in red are treated as mismatches in nucleotide alignment. Too many such degenerate codes within an input nucleotide query will cause PopGenIE BLAST to reject the input. For protein queries, too many nucleotide-like code (A,C,G,T,N) may also cause similar rejection.
² For protein code, U is replaced by X first before the search since it is not specified in any scoring matrices.
³ BLAST will not take "-" in the query. To represent gaps, use a string of N or X instead.
2.) Bare Sequence
This may be just lines of sequence data, without the FASTA definition line, e.g.:
GTGTACTCTTGTGATTGTGTTTCTCTCAGTGATCCTATCTATGTTATTGTTGTCTAGTAAATTGAAAGTAACCTAATAA TAGTAGAAACTTTAACACTACAAATGCTTACTAGGTCCAAGAAGAGAATAAGGGTGGAGACCATGGAGGCTTCGACCAAG GAGGCTTCAACAAAGGAGGTTACCAAGGAGGCCAGAGAGGAGGATATGGAAGAGGAAGAGGAAGAGGATATGATGGAGGA GGAAGACCACCTACCTTTAATGGTGGTGAGATAGGCCACTTGTCACGATTTTGTGCCAAGCCGCATGCACCGTGTGGGTA
Blank lines are not allowed in the middle of bare sequence input.
3.) Sequence file
This function allows users to upload a text file containing queries formatted in the formats outlined above. Long sequences should be uploaded through this option to avoid possible broswer buffer size limit.
For more information about BLAST please see the extensive documentation provided by the NCBI (BLAST docs).