GenomeNet BLAST e-mail server


                BLAST E-Mail Server on GenomeNet


BLAST (Basic Local Alignment Search Tool) was developed by the National
Center for Biotechnology Information at the National Library of Medicine
and made available for use on the GenomeNet E-mail server.  The BLAST
program employs a heuristic search algorithm to compare a protein or
nucleic acid query sequence against a protein or nucleic acid sequence
database.  BLAST compares sequences with databases using an ungapped
alignment algorithm.  If you use BLAST as a research tool, we ask that
the following reference be cited in your paper:

        S. F. Altschul, W. Gish, W. Miller, E. W. Myers and 
        D. J. Lipman (1990) J. Mol. Biol.  215, 403-410.

The GenomeNet BLAST server allows you to send a specially formatted mail
message containing the nucleic acid or protein query sequence to the
BLAST server at the Supercomputer Laboratory, Institute for Chemical
Research, Kyoto University.  A BLAST search is then performed against
the specified database and the results returned in a mail message.


Accessing the BLAST program

To access the program, send an electronic mail message containing the 
formatted query sequence (as described below) to the following Internet 
address:

        blast@genome.ad.jp

If you are not on Internet, you may need to change the format of the
address.  Consult your systems manager to determine the correct
address format.


Obtaining Help

If you would like to receive instructions on using the BLAST program,
send a mail message to the address above containing the word "help" on
a single line.  This document is then returned to you in a mail message.
The BLAST manual page is appended to the end of the help text.


Programs using the BLAST algorithm

One of the four programs is to be selected depending on the purpose
of the search.

    Designator          Purpose
    ----------          -------
    blastp              To compare a protein query sequence vs.
                        a protein sequence database.

    blastn              To compare a nucleic acid query sequence
                        vs. a nucleic acid sequence database.

    blastx              To compare a nucleic acid query sequence
                        translated in all reading frames vs. a
                        protein sequence database.

    tblastn             To compare a protein query sequence vs.
                        a nucleic acid sequence translated in
                        all reading frames.


Databases for use with BLAST

The following databases are currently available for BLAST searches:

    Designator          Database
    ----------          --------
    nr-nt               Non-redundant nucleic acid sequence database
                        constructed from genbank,embl,genbank-upd

    genbank             GenBank nucleic acid sequence database
                        (Latest quarterly release)

    embl                EMBL nucleic acid sequence database
                        (Latest quarterly release)

    nr-aa               Non-redundant protein sequence database
                        constructed from swissprot,pir,prf,genpept,
                        genpept-upd

    swissprot           SWISS-PROT protein sequence database

    pir                 PIR protein sequence database

    prf                 PRF protein sequence database


Formatting a Query

Queries consist of a mail message with search parameters identifying
the program, the database to be searched, values related to the search,
and the query sequence to be used in the search.  The mail message has
three mandatory lines, optional lines if any, and a line identifying
the query sequence as described below.  These lines are typed into the
body of the mail message in the order shown below:

 Search 
Parameter         Mandatory     Explanation
---------         ---------     -----------
PROGRAM             Yes         Indicates the program to be used (i.e.,
                                blastp, blastn, blastx, or tblastn;
                                see list above).

DATALIB             Yes         This line specifies the database to be 
                                searched (e.g., genbank, embl, pir, or
                                swissprot; see list above).

  No          Each option can be specified in the form of:
                                     = 
                                as described in the attached manual page.

BEGIN               Yes         This line must be included in the message. 
                                It should be the last parameter and followed
                                on the next line with the query sequence.
                                See below for the format of the query
                                sequence.

The remainder of the message contains the query sequence in FASTA
format.  See the sample below.


Preparing Files for Similarity Searches

Only one query sequence is allowed per mail message and your sequence
must be in FASTA format.  If your sequence file is in other formats,
it is possible to use an editor to change the file to FASTA format.
The format includes a mandatory comment line beginning with a greater-than
sign ">" followed by the name of the sequence, a space, and an optional
note about the sequence.  The sequence data begin on the next line without
the greater-than sign.  For example:

>AGREP4 Monkey SV40-like genomic segment promoting transcription.
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg 
tttccatttactttggattatacgtcattataaa


Sending the Query Sequence

Use your local mail program to send your query sequence.  Most
mail programs allow you to import a file containing your sequence into
the mail message.  You should import your sequence file into the mail
message on the line after "BEGIN".  Please follow the format in the
following example of a BLAST request PRECISELY, but note that the
program is case-insensitive, i.e.  either upper or lower case letters
may be used.

Here is an example of a mail message sent for a BLAST search.  Note that
the first four lines are a mail header that is automatically created
when you address a mail message.  Nothing need be entered for the
Subject.  NOTE: the text that you enter into the body of the message
begins with the "PROGRAM" keyword below (do not add blank lines in the
message).  Each line of information must be less than 80 characters in
length.  Longer lines may be truncated.


From:  somebody@someaddress.somewhere.ac.jp Wed May 27 16:34:06 1992
Date:  Wed, 27 May 92 16:38:02 JST
To:    blast@genome.ad.jp
Subject:  

PROGRAM blastn
DATALIB genbank
B=20
BEGIN
>BOVPRL GenBank entry BOVPRL from gbmam file.907 nucleotides. 
tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
caccaccatggacagcaaa


The example above uses the three mandatory keyword lines:
PROGRAM, DATALIB, and BEGIN, and an optional line specifying the
maximum number of high-scoring segment pairs to be reported.


Handling the Results of a BLAST Search

When the results are returned, use your local mail program to view
them.  You can transfer the results of a BLAST search to a separate
disk file to free up space in your mail directory.  Consult the
documentation for your local mail program for the commands to read
and transfer mail.


Retrieving Individual Entries Found in BLAST Searches

Database entries can be retrieved by either entry name or accession
number.  To use the GenomeNet database retrieval server, send an electronic
mail message to:

        dbget@genome.ad.jp

To get started, send a mail message to the address above containing the
word "help" on a single line in the body of the mail message.


                                                   Last Update: 93/04/08
==========================================================================
BLAST Manual Page



BLAST(1)                 USER COMMANDS                   BLAST(1)



NAME
     blastp, blastn, blastx, tblastn -  rapid  sequence  database
     query programs using the BLAST algorithm

SYNOPSIS
     blastp aadb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
                         [M=subfile] [Y=#] [Z=#] [K=#] [L=#] [H=#] [V=#] [B=#]

     blastn ntdb ntquery [E=#] [S=#] [W=#] [X=#] [M=#] [N=#] [Y=#] [Z=#]
                         [K=#] [L=#] [H=#] [V=#] [B=#] [[top][bottom]]

     blastx aadb ntquery [E=#] [S=#] [W=#] [T=#] [X=#] [M=subfile]
                         [Y=#] [Z=#] [C=#] [K=#] [L=#] [V=#] [B=#]
                         [[top][bottom]]

     tblastn ntdb aaquery [E=#] [S=#] [E2=#] [S2=#] [W=#] [T=#] [X=#]
                         [M=subfile] [Y=#] [Z=#] [C=#] [K=#] [L=#]
                         [H=#] [V=#] [B=#] [[top][bottom]]

DESCRIPTION
     BLAST (Basic Local Alignment Search Tool) is  the  heuristic
     search  algorithm  employed  by the programs blastp, blastn,
     blastx, and tblastn.  The four programs  are  used  for  the
     following purposes:

     blastp
          to compare an amino acid query sequence vs.  a  protein
          sequence database;

     blastn
          to compare a nucleotide query sequence vs. a nucleotide
          sequence database;

     blastx
          to compare a nucleotide query  sequence  translated  in
          all reading frames vs. a protein sequence database;

     tblastn
          to compare a protein query sequence  vs.  a  nucleotide
          sequence database dynamically translated in all reading
          frames.

     Whenever a nucleotide query sequence or nucleotide  database
     is  involved,  both  strands  (or  all 6 reading frames) are
     searched by default.  The "top" and "bottom" options may  be
     used to restrict a search to the specified strand.  (If both
     options are specified, both strands will be searched).

     The unit of BLAST algorithm output is the High-scoring  Seg-
     ment  Pair (HSP), where a segment is an arbitrarily long run
     of contiguous residues.  In the programmatic implementations
     of  the  algorithm  described  here,  an  HSP  is  a pair of



Sun Release 4.1   Last change: 29 December 1991                 1






BLAST(1)                 USER COMMANDS                   BLAST(1)



     segments, one from the query sequence and one from  a  data-
     base  sequence,  where the score of their ungapped alignment
     meets or exceeds a parametrized, positive-valued cutoff.   A
     set of zero or more HSPs is thus defined by two sequences, a
     scoring scheme, and a cutoff score.

     A Maximal-scoring Segment  Pair  (MSP)  is  defined  by  two
     sequences and a scoring scheme and is the highest-scoring of
     all segment pairs on all diagonals.  Depending on the param-
     eters  of  a  BLAST sequence comparison, there may be a non-
     zero probability of not finding one or more  HSPs  of  which
     the MSP is a member.

PARAMETERS
     Parameters are modified using  a  name=value  syntax,  e.g.,
     E=0.05 or S=100.

     E is interpreted as the expected number of  MSPs  that  will
     satisfy  the  cutoff  score under the random sequence model.
     The value of E approximates the expected number of HSPs that
     will  be  found  during  the  course  of  an entire database
     search.  The default value for E is 10,  and  the  permitted
     range for this real valued parameter is 0. < E <= 1000.

     S is the cutoff score for  reporting  HSPs.   Higher  scores
     correspond  to  increasing  statistical  significance (lower
     probability or reduced expected  frequency  of  occurrence).
     Any  positive-scoring alignments which the programs find but
     which score below S are not reported.  Unless  S  is  expli-
     citly  set  on the command line, its default value is calcu-
     lated from the value of E.

     The values for E and S are interconvertable, a process which
     is  dependent on the following factors: the length and resi-
     due composition of the query sequence;  the  length  of  the
     database  and  a fixed, hypothetical residue composition for
     it; and the scoring scheme  employed.   The  scoring  scheme
     used  by  blastp,  blastx,  and  tblastn  is  a substitution
     matrix; the scoring scheme used  by  blastn  is  a  positive
     reward  score  for  matching residues and a negative penalty
     score for mismatched residues.

     When both of the parameters E and S  are  specified  on  the
     command  line,  the  one resulting in the highest (most res-
     trictive) cutoff score will be used.  When neither of  these
     parameters  is  specified  on  the command line, the default
     value for E is used to calculate the cutoff score.

     For a given value of E (e.g., the default value  of  10),  a
     given  query sequence, and a single scoring scheme, the cal-
     culated value of the cutoff score S will be  different  when
     searching  databases of different lengths.  To normalize the



Sun Release 4.1   Last change: 29 December 1991                 2






BLAST(1)                 USER COMMANDS                   BLAST(1)



     statistics reported when databases of different lengths  are
     searched,  the  parameter Z (see below) may be set to a con-
     stant value for all database searches.

     S takes on only integral values in the  present  implementa-
     tions  of the BLAST algorithm.  When the cutoff score is set
     implicitly via E, S is rounded to the least  integral  value
     required  to  satisfy  E.   Since the rounding procedure can
     decrease the effective value of E, the calculated value  for
     S  is used to back-calculate the effective value for E.  For
     example, if the user specifies E = 50 on the command line, a
     cutoff score that is rounded up by 0.9 units to the smallest
     satisfying integer might correspond to an expected number of
     HSPs of only 43.  In this case, the value displayed for E at
     the end of the program's report will be 43.

     When at least one HSP is found involving any given  database
     sequence,  the  programs blastp and tblastn search the data-
     base sequence a second time for HSPs that  satisfy  a  lower
     cutoff  score, S2.  In essence, the second-pass search gives
     these  programs  the  opportunity   to   report   any   low-
     significance  HSPs  they  may  have  found  that might be of
     interest within the context of finding one or  more  higher-
     scoring  (perhaps  statistically significant) HSPs.  Poisson
     statistics may  indicate  that  the  lower-scoring  (higher-
     probability)  HSPs  are statistically significant when their
     frequencies of occurrence are considered.

     In a relationship similar to that between the  parameters  E
     and  S,  S2  can be set explicitly on the command line or it
     will be calculated from the setting of  E2.   Whereas  S  is
     related  to  E by the size of the database and the length of
     the query sequence, S2 is related to E2 by the lengths of  a
     pair of hypothetical protein sequences of 300 residues each.
     In other words, E2 approximates the number of HSPs one would
     expect  to  find  when  comparing  two  protein sequences of
     length 300, one having the composition of the query sequence
     and the other having the hypothetical residue composition of
     the database.  If a second-pass search is not desired,  set-
     ting  E2  to zero (0) turns this feature off.  If S2 happens
     to be equal to or greater than the primary cutoff  score,  a
     second-pass search is not performed, as well.

     The user should be forewarned that, with no other  knowledge
     about  a  positive-scoring  segment pair than its score, the
     chance that the BLAST algorithm will not find the  alignment
     increases  as  the score of the alignment decreases.  Conse-
     quently, the low-scoring HSPs looked for in the  second-pass
     search have a smaller chance individually of being found.

     With a fixed scoring scheme, the probability of  missing  an
     alignment  can  be  decreased  by: lowering the neighborhood



Sun Release 4.1   Last change: 29 December 1991                 3






BLAST(1)                 USER COMMANDS                   BLAST(1)



     word-score threshold, T, while keeping  the  word  size,  W,
     constant;  lowering both W and T appropriately (see Altschul
     et al., 1990); and/or raising the  word-hit-extension  drop-
     off score X (described below).

     W is the word size for  finding  initial  hits  against  the
     database sequences.  Each hit is extended in both directions
     along  the  corresponding  diagonal  of  an   imaginary   2-
     dimensional  matrix  until the segment score drops off by at
     least the quantity X.  The default value for W  is  3  amino
     acids  for  blastp,  blastx, and tblastn, and 12 nucleotides
     for blastn.  The value of W used by  blastn  should  not  be
     changed,  as  the  logic  of the program source code has not
     been validated for use with values other than  the  default.
     For  the  other programs, which perform sequence comparisons
     at the level of individual amino acids, W  should  generally
     be  restricted to values less than 5 or else the value for T
     should be specified disproportionately larger to avoid  con-
     suming  vast  quantities of memory for the neighborhood word
     list (see below).

     T is the word score threshold  for  generating  neighborhood
     words of length W from the query sequence, prior to scanning
     the database (blastp,  blastx,  and  tblastn  only).   Words
     which  have  an  aggregate  score  (through summation of the
     individual residue substitution scores) of at least  T  when
     aligned  with  words from the query sequence are included in
     the neighborhood list.  Raising the value of T increases the
     likelihood  of completely missing HSPs, but can decrease the
     search time and  memory  requirements  of  the  programs  by
     decreasing  the  size  of the neighborhood list.  One of the
     key (but not unique) features of the BLAST algorithm is  the
     user-selectable trade-off in sensitivity for speed.

     A generally suitable value for T is calculated at  run-time,
     using  the  residue  composition  and  length  of  the query
     sequence and the substitution matrix employed.   The  neigh-
     borhood word-score threshold is set using an ad hoc equation
     that is a function of Lambda and H. Lambda is the number  of
     nats  of information gained per unit increase in score of an
     alignment (approximately 0.69315 times the  number  of  bits
     per unit score); H is the relative entropy of the target and
     background residue frequencies [Karlin and Altschul,  1990],
     or  the  expected  information  available per position in an
     alignment to distinguish it from chance.

     The supplied PAM120 amino acid substitution matrix,  with  a
     scale of ln(2)/2, yields a value for Lambda that is close to
     0.5 bit per unit score for query sequences of typical  resi-
     due compositions.  Occasionally it may be necessary to manu-
     ally set the neighborhood word-score threshold via the  com-
     mand line, for which 13 may be a good value to try, but this



Sun Release 4.1   Last change: 29 December 1991                 4






BLAST(1)                 USER COMMANDS                   BLAST(1)



     is highly dependent on the  substitution  matrix  and  word-
     length, W, being employed.

     X is a positive integer representing the maximum permissible
     drop-off  of  the  cumulative  segment score during word-hit
     extension.  Raising X may decrease the chance that the BLAST
     algorithm   overlooks  an  HSP,  but  it  may  significantly
     increase the search time, as well.  If computation  time  is
     of  little concern, X might be increased several points from
     its default value, but only a very marginal increase in sen-
     sitivity might be expected.

     For blastp, blastx, and tblastn, the default value of  X  is
     calculated  to be the minimum integral score representing at
     least 10 bits of information, or a reduction in the statist-
     ical  significance  of the alignment by a factor of 2 to the
     10th power (about 1,000).  For blastn, the default value  of
     X  is the minimum integral score that represents at least 20
     bits of information, or a reduction in the statistical  sig-
     nificance  of  the  alignment  by  a factor of 2 to the 20th
     power (about one million).

     The command line parameters K and L can be used to set fixed
     values  for  the Karlin statistics' K and lambda parameters,
     respectively.  Users should generally  avoid  setting  these
     parameters  unless  the  full  ramifications of doing so are
     understood.  For example of one of the less obvious  effects
     of  manually  choosing  these parameters, the value of the H
     statistic reported at  the  end  of  each  program's  output
     (which  is  distinct  from the command line parameter of the
     same name) is a function of lambda; and  the  default  value
     for  the neighborhood word-score threshold parameter T is in
     turn a function of H.

SCORING SCHEMES
     With blastp, blastx, and tblastn, the M option can  be  used
     to  select  an  alternate  substitution  matrix  file.   The
     default PAM120 matrix is  recommended  for  general  protein
     similarity searches (Altschul, 1991).  While only the PAM120
     and the PAM250 matrices are provided, the pam(1) program can
     be  used  to  produce PAM matrices of any desired generation
     from 2 to 511.  For rigorous searches where  the  mutational
     distance  between  potential  homologs  is unknown, Altschul
     (1991) recommends performing three searches, one  each  with
     the PAM-40, PAM-120, and PAM-250 matrices.

     In blastn, M is the score for a single-letter  match;  N  is
     the  score  for  a  single-letter mismatch.  M and N must be
     positive and negative  integers,  respectively.   Given  the
     assumption  made  by  blastn that the 4 nucleotides A, C, G,
     and T are represented equally in the database, the  expected
     score for the query sequence must be negative.



Sun Release 4.1   Last change: 29 December 1991                 5






BLAST(1)                 USER COMMANDS                   BLAST(1)



SEQUENCE LENGTH AND STATISTICAL SIGNIFICANCE
     For the purpose of calculating significance levels, Y is the
     effective  length  of the query sequence and Z is the effec-
     tive length of the database, both measured in residues.  The
     default  values  for these parameters are the actual lengths
     of the query sequence and  database,  respectively.   Larger
     values  signify  more  degrees  of  freedom for aligning the
     sequences and reduced statistical significance for an align-
     ment of any given score.

GENETIC CODES
     C is a non-negative integer that determines the genetic code
     that will be used by blastx (tblastn) to translate the query
     sequence (database sequences).   The  default  genetic  code
     (C=0)  corresponds  to  the  so-called Standard or Universal
     genetic code.  To obtain a listing  of  the  nine  available
     genetic  codes  and  their associated numerical identifiers,
     invoke either blastx or tblastn with the command line param-
     eter C=list.

     The current list  of  genetic  codes  and  their  associated
     values for parameter C are:

     0 Standard or Universal

     1 Vertebrate Mitochondrial

     2 Yeast Mitochondrial

     3 Mold Mitochondrial and Mycoplasma

     4 Invertebrate Mitochondrial

     5 Ciliate Macronuclear

     6 Protozoan Mitochondrial

     7 Plant Mitochondrial

     8 Echinodermate Mitochondrial

POISSON STATISTICS
     The occurrence of two  or  more  HSPs  involving  the  query
     sequence  and  the  same  database  sequence is modeled as a
     Poisson process.  An important result  of  applying  Poisson
     statistics  is  that an HSP with a low score and high Expect
     value (low significance) may be discovered to  be  statisti-
     cally  significant  when  appearing in the context of one or
     more additional matches of equal or higher score against the
     same database sequence.





Sun Release 4.1   Last change: 29 December 1991                 6






BLAST(1)                 USER COMMANDS                   BLAST(1)



     The Poisson P-value for any given HSP is a function  of  its
     expected  frequency  of  occurrence  and  the number of HSPs
     actually observed with scores at least as high.  The Poisson
     P-value for a group of HSP events is the probability that at
     least as many HSPs would occur by chance, each with a  score
     at  least as high as the lowest-scoring member of the group.
     HSPs which appear on opposite strands of a nucleotide  query
     or  database sequence are considered independent and distin-
     guishable events, and so are counted separately.

     Given the score of an HSP, when the expected length  for  an
     alignment  with  that score (see the description of H above)
     is a  significant  fraction  of  the  length  of  the  query
     sequence, the Expect value used in estimation of the Poisson
     P-value is reduced proportionately.

P-VALUES, ALIGNMENT SCORES, AND INFORMATION
     The Expect and P-values of HSPs reported by the programs are
     dependent  on numerous factors including: the scoring scheme
     employed, the residue composition of the query sequence,  an
     assumed residue composition for a typical database sequence,
     and the query and  database  lengths.   Independent  of  the
     query  and  database  lengths are the HSP scores themselves,
     which may be readily compared between different program runs
     even  if the databases searched are of different lengths, as
     long as all of the other relevant factors listed  here  were
     unchanged.

     Further isolation from the many variables  of  a  search  in
     one's  assessment of an HSP may be obtained by observing the
     information content reported (in bits) for  the  alignments.
     While the information content of an HSP may change when fun-
     damentally different scoring schemes are  used  (e.g.,  dif-
     ferent  generations  of  PAM  matrices),  the number of bits
     reported for an HSP will be independent  of  the  scales  to
     which  the  matrices  were  generated.   (In  practice, this
     statement is not quite true because the substitution  scores
     used  by  these  programs  are floating point or real values
     which have been rounded to nearest integers and thus lack  a
     high degree of precision).  When communicating the statisti-
     cal significance of an alignment, the alignment score itself
     is generally not so important as the combination of the sub-
     stitution matrix employed and the actual information content
     of the alignment.

REGULATING OUTPUT
     The output is categorized into three sections:  a  histogram
     of  word-hit  extension scores; one-line descriptions of the
     database sequences that yielded one or more  HSPs;  and  the
     high-scoring  segment pairs themselves.  Each section of the
     output can be selectively suppressed by setting the  parame-
     ters H, V, and B to 0 (zero).



Sun Release 4.1   Last change: 29 December 1991                 7






BLAST(1)                 USER COMMANDS                   BLAST(1)



     Parameter H regulates the display of  an  histogram  of  the
     scores  of the highest-scoring hit extensions for each data-
     base sequence.  As long as H has a non-zero value, the  his-
     togram  will  be  displayed  (except for the blastx program,
     which never displays an histogram but retains the H  parame-
     ter for command-line compatibility with the other programs).
     The default value for H is 1.

     Parameter V is the maximum number of database sequences  for
     which  one-line  descriptions will be reported.  The default
     value for V  is  500.   A  warning  message  is  prominently
     displayed  at  the  end of the one-line descriptions section
     when HSPs are found in more than V  sequences.   When  V  is
     zero,  no  one-line descriptions are reported and no warning
     is given.  Negative values for V are  undefined  and  disal-
     lowed.

     As an example of how V can be used advantageously, if a high
     value for E is desired to virtually assure in all cases that
     at least one HSP will be found, selecting a small value  for
     V  will  ensure  that the output will not be too voluminous;
     only the most  statistically  significant  matches  will  be
     reported.

     Parameter B regulates the display of the  high-scoring  seg-
     ment pairs.  For positive values, B is the maximum number of
     database sequences for which high-scoring segment pairs will
     be  reported.   This  may  be  much  smaller than the actual
     number of high-scoring segment  pairs  reported,  since  any
     given database sequence may yield several HSPs.  The default
     value for B is 250.  Negative values for B are undefined and
     disallowed.

SUPPORT UTILITIES
     Databases to be searched by these  programs  must  first  be
     processed  by  the  program setdb for protein sequence data-
     bases (re: blastp and blastx) and the  program  pressdb  for
     nucleotide sequence databases (re: blastn and tblastn).

     Point accepted mutation (PAM) matrices  of  various  genera-
     tions  can  be  produced automatically with the pam program.
     The output can be saved in a file whose name is then  speci-
     fied  in  the  M=filename  option  of  a  blastp, blastx, or
     tblastn query.

BUGS
     blastn uses a large value for the wordlength, W, and does no
     neighboring  on  these  words.  Consequently, the program is
     suitable for finding nearly identical sequences rapidly.  To
     identify  weak  amino  acid  similarities encoded by nucleic
     acid, use blastx or tblastn.




Sun Release 4.1   Last change: 29 December 1991                 8






BLAST(1)                 USER COMMANDS                   BLAST(1)



     In blastp, blastx, and tblastn, ad hoc  equations  have  not
     been  implemented  yet  for  calculating appropriate default
     values for T when W has a value other than 3 or 4.

     When  nucleotide  sequence  databases  are  processed   into
     searchable  form  by  the  pressdb  program, IUPAC ambiguity
     letters are replaced by an appropriate random selection from
     the list A, C, G and T. (For example, an R would be replaced
     on the average half of the time by an A and half of the time
     by  a  G).   Similarly, blastn replaces ambiguity letters in
     the query sequence with appropriate random selections.  Only
     after  an  HSP  is found that satisfies the cutoff score are
     the original sequences with their ambiguities  intact  exam-
     ined.   With  blastn,  the alignment score will decrease and
     may consequently fall below the cutoff score if  the  random
     replacement  letter  happened  to  match.   With  blastx and
     tblastn, the outcome will depend  upon  whether  a  specific
     amino acid can be inferred despite the ambiguity.

     tblastn uses only one genetic code to translate  the  entire
     nucleotide   sequence   database,  although  the  particular
     genetic code employed is selectable via the parameter C.

     blastn, blastx, and  tblastn  treat  U  and  T  residues  in
     nucleotide sequences the same.

     With one exception, any letter in the query  sequence  which
     is  not a member of the relevant IUPAC amino acid or nucleo-
     tide code  is  stripped  and  does  not  contribute  to  the
     sequence  coordinate  numbers reported by the programs.  The
     exception is asterisks (*) in amino  acid  sequences,  which
     are  interpreted  as translation stops.  In protein sequence
     databases that are processed into  searchable  form  by  the
     setdb  program, non-IUPAC letters, including any punctuation
     but excluding asterisks, are  also  stripped.   The  pressdb
     program  does  not  strip  non-IUPAC  codes, but treats them
     similarly to Ns.

     blastn does not incorporate the concept  of  a  partial-  or
     half-match,  such as when a purine in one sequence is juxta-
     posed with a purine from the other.   For  two  residues  to
     match  at  all, they both must be members of the set A, C, G
     and T (or U).

     When calculating the Poisson statistics, some  HSPs  may  be
     incompatible  with each other (not all of them may be simul-
     taneously alignable without reusing some portion  of  either
     sequence) and yet they are (incorrectly) counted as indepen-
     dent events.

     The user may note  that  the  nucleotide  composition  of  a
     blastn  query sequence is irrelevant to the resulting Karlin



Sun Release 4.1   Last change: 29 December 1991                 9






BLAST(1)                 USER COMMANDS                   BLAST(1)



     parameters, Lambda and K.  This is due to the residue compo-
     sition assumed for a typical database sequence being 25% for
     each of the four nucleotides A C, G and T. The values of the
     Karlin  parameters  are still affected by the scoring scheme
     employed.  Furthermore, the individual  who  compiles  these
     programs  is certainly not barred from setting a non-uniform
     residue composition for the  database  sequences,  in  which
     case the query composition is relevant and will be reflected
     in the Karlin parameters calculated by blastn.

SEE ALSO
     blast3(1).

REFERENCES
     Karlin, Samuel and Stephen F. Altschul (1990).  Methods  for
     assessing the statistical significance of molecular sequence
     features by using general scoring schemes, Proc. Natl. Acad.
     Sci. USA 87:2264-2268.

     Altschul, Stephen F., Warren Gish, Webb  Miller,  Eugene  W.
     Myers,  and  David  J. Lipman (1990).  Basic local alignment
     search tool, J. Mol. Biol.  215:403-410.

     Altschul,  Stephen  F.  (1991).   Amino  acid   substitution
     matrices  from an information theoretic perspective. J. Mol.
     Biol.  219:555-565.





























Sun Release 4.1   Last change: 29 December 1991                10

________________________________________