Genome Information Research Center, Osaka Univ.

BLITZ e-mail server

HELP [BLITZ] May 12, 1993

Introduction
------------

BLITZ is an automatic electronic mail server for the MPsrch program of 
Shane Sturrock and John Collins, Biocomputing Research Unit, University
of Edinburgh, Scotland [1].  

MPsrch allows you to perform sensitive and extremely fast comparisons of your
protein sequences against the Swiss-Prot protein sequence database using the
Smith and Waterman best local similarity algorithm [2].  It runs on the MasPar 
family of massively parallel machines; the BLITZ server uses a 4096-processor
MasPar MP-1 system.   A typical search time for a query sequence of 400 amino 
acids is approximately 40 seconds to search the entire Swiss-Prot 23 release.
Additional time is required to reconstruct the alignments; the time for this 
will depend on the number of alignments requested.  MPsrch is the fastest 
implementation of the SW algorithm currently available on any machine.  

This documentation describes how to use the EMBL BLITZ server to submit MPsrch
jobs.   The original MPsrch documentation is included in an appendix at the
end of this file.   

MPsrch answers the question: which sequences in the database are most similar 
(or contain the most similar regions) to my query sequence.   


How to use BLITZ
----------------

Send a properly formatted electronic mail message to 

                    BLITZ@EMBL-Heidelberg.DE

containing some of the commands listed below and the answer will be 
automatically mailed to you.

If you have any problems using the BLITZ service, or any questions, please
send them to:

	NETHELP@EMBL-Heidelberg.DE


Example
-------

Here is a quick summary example (the parameters are explained below):

TITLE RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
PAM 200
INDEL 10
ALIGN 50
SEQ
      STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL
      FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE
      YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM
      TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ
      VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
END



Databases available
-------------------

At present, only the latest release of the SwissProt protein database is 
available.  This is updated 4 times a year.  In the future, SwissProt entries
will be available to be searched between releases as a seperate cumulative
database. 


The Input Format
----------------

BLITZ is an automatic server that runs without any human intervention. 
Therefore, it understands a limited set of commands.  These are listed
below.  Defaults are offered for all of the parameters used in the search
except the search sequence itself.  

Here are some general rules:

- Your mail message must contain only one command per line.
- There is only one mandatory command, SEQ, unless you just want to get this
  help file. All the search parameters are optional, and default values will be 
  used whenever they are not specified.
- You can use both uppercase and lowercase characters, or mix them.
- The order of the commands is not important, but SEQ is usually the last
  one, since everything following this line will usually be treated as a 
  sequence (see below).
- Blank lines or space characters are accepted.
- Only one search per mail message is allowed.


Here is a list of valid commands that are accepted by BLITZ:

HELP		Use this command to request this help document.

PAM n           PAM matrix where "n" is a number between 1 and 500.  This 
(protein only)  command sets the amino acid weight matrix that is used to 
                score non-identical amino acids in the search.  They
                are usually known as "Dayhoff matrices" [3].  The default
                value of 120 is used if you do not specify a PAM matrix.  
                Initially, just use the default value i.e. leave this command
                out.   Using different PAM values can often have a dramatic
                effect on which sequences you find in a search.  A more 
                detailed description of why this is so is given at the end of 
                this document.  

INDEL n         Indel or gap penalty where "n" is a small integer, typically
                in the range 5 to 30.  If you omit this parameter, a default
                will be used that depends on the PAM matrix chosen.  The default
                indel cost for a PAM matrix setting of 120 is 13 and must not 
                be set below 7.   If you give an illegal value for indel cost,
                a suitable default will be used.  The indel cost actually used
                in any search will be printed in the output file so that you 
                can change it if desired.  Initially just omit this parameter.
                Decrease this parameter to encourage gaps; increase it to 
                discourage them.

ALIGN n         Number of best alignments to see; default is 30; maximum is 100.

NAMES n		Number of scores to report (can be more than ALIGN)

TITLE s         A one-line title for the search where "s" is text. 
                The first word will be used as the "name" of the query sequence
                and the rest of the line will be used as a description.
                e.g. 
                TITLE my_sequence  test using PAM 120 of my sequence
                will use "my_sequence" as the name and
                "test using PAM 120 of my sequence" as description. 
                Do not use quotes or double quotes in the title!

SEQ             (MANDATORY)
                *Everything* following this line up to
                a) the end of the mail message, or
                b) a line starting with the word END
                   will be treated as part of the sequence.

                Don't put sequence information on the same line on
                which you put the SEQ command or the END command.

                No special format of the query sequence is required.
                You may include numbering, but make sure to remove all
                sorts of comments and unrelated information like mail
                signatures if you don't use END!



Examples of input files:
------------------------

This example will ask for the 50 best alignments using an indel cost of 10
and a PAM 200 weight matrix.


TITLE RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
PAM 200
INDEL 10
ALIGN 50
SEQ
       1  STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL

      51  FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE

     101  YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM

     151  TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ

     201  VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
END



Alternatively, the example below will just use the defaults of PAM 120, INDEL
12 and ALIGN 30.


SEQ
 STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL
 FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE
 YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM
 TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ
 VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
END



Current Restrictions
--------------------

The maximum number of alignments that you can request is 100.
The PAM matrix value must be between 1 and 500 inclusive.
The maximum query sequence length is 10000.
Only one job per mail request is allowed.


The BLITZ output
-----------------

After sending your query to EMBL you will receive 2 mail messages from the
BLITZ server: 
          1) a LOG file to indicate the status of your job; 
          2) the output from the MPsrch program.   


The LOG file produced by the first example above is:

 >From:	Blitz@EMBL-Heidelberg.DE  5-JAN-1993 17:52:36.92
 >To:	Joe.Biologist@EMBL-Heidelberg.DE
 >CC:	
 >Subj:	Thanks for your call;  here's the log ...
 >
 >
 >TITLE RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
 >PAM 200
 >INDEL 10
 >ALIGN 50
 >SEQ
 >
 >       1  STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL 
 >
 >      51  FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE 
 >
 >     101  YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM 
 >
 >     151  TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ 
 >
 >     201  VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
 >END
 >
 >* An MPsrch batch job has been submitted to the BLITZ machine.
 >* The following parameters are used:
 >* Title: RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
 >* Library to be searched:  Swiss-Prot
 >* Number of alignments:    50
 >* PAM:                     200
 >* INDEL:                   10
 >* The result file will be mailed to you after completion.



The output file (only showing the first 3 and the last results in this case) 
from the MPsrch program for the first example above is shown below.

The results from MPsrch
-----------------------

 >From:	Blitz@EMBL-Heidelberg.DE  5-JAN-1993 17:53:55.56
 >To:	Joe.Biologist@EMBL-Heidelberg.DE
 >CC:	
 >Subj:	Results for: RPC1_Lambd this is a test using the Lambda 
             CI repressor from Swiss-Prot.
 >
 >
 >Search started: Tue Jan  5 17:51:00 1993
 >
 >MPsrch:         Version 1.2 - Shane S. Sturrock & John F. Collins 1992.
 >                Biocomputing Research Unit, University of Edinburgh, UK.
 >
 >Title:          RPC1_Lambd
 >Description:    this is a test using the Lambda CI repressor from Swiss-Prot.
 >Sequence:       1 STKKKPLTQEQLEDARRLKA..........ESCSVVGKVIASQWPEETFG 236
 >
 >Parameters:     swiss-prot23;  PAM 200;  Penalty 10;  Align 50
 >
 >Predicted No. is the number of results expected by chance to have a score
 >greater than or equal to the score of the result being printed, and is
 >derived by analysis of the total score distribution which gave:
 >
 >                Mean 16.296862;  Variance 354.401516;  scale 0.045984

The mean and variance of the distribution of scores from the entire database
are calculated.   These are used later to help provide significance measures
for the strength of matches found in the search [4].

 >
 >Result #1
 >>RPC1_LAMBD P03034 REPRESSOR PROTEIN CI.

The first hit with its Swiss-Prot one line title.  In this case, this is
the protein we searched with so it is not surprising that we found it first.

 >
 >        Score: 1370;  Indels: 0;  Gaps: 0;  Predicted No.: 3.354311e-24;
 >                  Matches: 236;  Mismatches 0;  Partials 0;

This is perfect hit with 0 gaps or indels and 236 matching residues (exact
matches).   The predicted number is an estimate of the number of sequences 
that we would expect to find randomly with a score greater than the observed
score of 1370; in this case it is extremely small (not surprisingly).

 >
 >          ************************************************************
 >Db      1 STKKKPLTQEQLEDARRLKAIYEKKKNELGLSQESVADKMGMGQSGVGALFNGINALNAY 60
 > Q      1 STKKKPLTQEQLEDARRLKAIYEKKKNELGLSQESVADKMGMGQSGVGALFNGINALNAY 60
 >
 >          ************************************************************
 >Db     61 NAALLAKILKVSVEEFSPSIAREIYEMYEAVSMQPSLRSEYEYPVFSHVQAGMFSPELRT 120
 > Q     61 NAALLAKILKVSVEEFSPSIAREIYEMYEAVSMQPSLRSEYEYPVFSHVQAGMFSPELRT 120
 >
 >          ************************************************************
 >Db    121 FTKGDAERWVSTTKKASDSAFWLEVEGNSMTAPTGSKPSFPDGMLILVDPEQAVEPGDFC 180
 > Q    121 FTKGDAERWVSTTKKASDSAFWLEVEGNSMTAPTGSKPSFPDGMLILVDPEQAVEPGDFC 180
 >
 >          ********************************************************
 >Db    181 IARLGGDEFTFKKLIRDSGQVFLQPLNPQYPMIPCNESCSVVGKVIASQWPEETFG 236
 > Q    181 IARLGGDEFTFKKLIRDSGQVFLQPLNPQYPMIPCNESCSVVGKVIASQWPEETFG 236

Stars are used to mark identical residues.

 >
 >
 >Result #2
 >>RPC2_BPP22 P03035 REPRESSOR PROTEIN C2.
 >
 >        Score: 370;  Indels: 8;  Gaps: 5;  Predicted No.: 4.309500e-04;
 >                  Matches: 72;  Mismatches 72;  Partials 54;
 >
 >          ... * . * ... ..*... ...    . .  *. *   *.* *. * . .  .   . 
 >Db     13 RRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQCSPDYLLKGDLSQT 72
 > Q     25 KKNELGLSQESVADKMGMGQSGVGALFNGINALNAYNAALLAKILKVSVEEFSPSIAREI 84
 >
 >             * .   .*  *.   **..* * **     .  . *   * *  **   *. .***.
 >Db     73 NVAYHS-RHEP--RG--SYPLISWVSAGQWMEAVEPYHKRAIENWHDTTVDCSEDSFWLD 127
 > Q     85 YEMYEAVSMQPSLRSEYEYPVFSHVQAGMFSPELRTFTKGDAERWVSTTKKASDSAFWLE 144
 >
 >          *.*.*****.*   *.*.**.******     * . .*.* *. * *****. *.*. **
 >Db    128 VQGDSMTAPAG--LSIPEGMIILVDPEVEPRNGKLVVAKLEGENEATFKKLVMDAGRKFL 185
 > Q    145 VEGNSMTAPTGSKPSFPDGMLILVDPEQAVEPGDFCIARLGGD-EFTFKKLIRDSGQVFL 203
 >
 >          .*********  * .* ..* *. ..
 >Db    186 KPLNPQYPMIEINGNCKIIGVVVDAK 211
 > Q    204 QPLNPQYPMIPCNESCSVVGKVIASQ 229
 >

The second hit is also against a bacteriophage repressor protein and its
score also looks significant as the predicted number of random matches at 
this level is still very small (0.00043 hits expected randomly at this level).

Partial matches (residues that are judged to be "similar" according to the
weight matrix used; actually, those with a positive score) are marked with
dots.

Gaps are indicated by "-" characters; each gap is caused by the insertion or 
deletion of one or more amino acids from one of the sequences at a site.  
An indel is one item in a gap (just one "-" character).  Seperate counts
are given for indels and gaps.

 >
 >Result #3
 >>UMUD_ECOLI P04153 UMUD PROTEIN.
 >
 >        Score: 112;  Indels: 14;  Gaps: 10;  Predicted No.: 8.037992e+01;
 >                  Matches: 41;  Mismatches 50;  Partials 33;
 >
 >          . . **    .*.**  ** * *..    .     .       . . ... .   *.** 
 >Db      5 KPADLREIVTFPLFSDLVQCG-FPSPAADYVEQRID-LNQLLIQHPSATYFVKASGDSMI 62
 > Q     93 MQPSLRSEYEYPVFSH-VQAGMFSPELRTFTKGDAERWVSTTKKASDSAFWLEVEGNSMT 151
 >
 >             *   ...** *..**.  .   **. ** . * *** *** .* . *  * *.*. *
 >Db     63 --DG---GISDGDLLIVDSAITASHGDIVIAAVDG-EFTVKKLQLRPTVQ--LIPMNSAY 114
 > Q    152 APTGSKPSFPDGMLILVDPEQAVEPGDFCIARLGGDEFTFKKL-IRDSGQVFLQPLNPQY 210
 >
 >           *. *. ...  * * **
 >Db    115 SPITISSEDTLDVFGVVI 132
 > Q    211 -PM-IPCNESCSVVGKVI 226
 >
 >

We can skip the next 46 hits and just look at the 50th.

 >
 >Result #50
 >>TAG1_RAT P22063 AXONAL GLYCOPROTEIN TAG-1 PRECURSOR.
 >
 >        Score: 86;  Indels: 4;  Gaps: 4;  Predicted No.: 2.809793e+02;
 >                  Matches: 16;  Mismatches 32;  Partials 28;
 >
 >          *.*     *.. **  . .  *       .. .. ....  ..*. *.   **.  .  *
 >Db     39 PIFEEQPIGLLFPE-ES-A-EDQVTLACRARASPPATYRWKMNGTDMNLEPGSRHQLMGG 95
 > Q    104 PVFSHVQAGMFSPELRTFTKGDAERWVSTTKKASDSAFWLEVEGNSMTAPTGSKPSFPDG 163
 >
 >           *... * .. ..* . *.*
 >Db     96 NLVIMSPTKTQDAGVYQCLA 115
 > Q    164 MLILVDPEQAVEPGDF-CIA 182
 >
 >Search completed: Tue Jan  5 17:51:44 1993


At this level of match (score = 86) you expect to see about 280 sequences
in the database randomly.  Therefore we have no evidence for it to be a 
genuine case of homology.   On biological grounds it does not look very
likely either (a rat axonal glycoprotein).



Retrieving Database Entries
---------------------------

You can easily get a copy of matching sequences from the Swiss-Prot database by 
using the EMBL File Server. You should use the accession number or entry name 
as given in Swiss-Prot entries and the BLITZ output.

Send a mail message to

                               NETSERV@EMBL-Heidelberg.DE

containing one command per line. The general syntax is:

                               GET PROT:accnumber       or
                               GET PROT:entryname

e.g.                           GET PROT:P22063          or
                               GET PROT:RPC1_LAMBD    



If you are new to the EMBL File Server, send a mail message to 

          Netserv@EMBL-Heidelberg.DE

containing the line HELP to get some introductory information.
The File Server offers the latest sequence data, several other databases and
free software for molecular biology.




The Algorithm
-------------

MPsrch uses the well known Smith and Waterman [2] algorithm for searching
the database.  Your query sequence is compared against all sequences in the
database and the best results, as judged by the alignment score, are aligned 
and included in the output.   The algorithm looks for the best "local" match
as determined by the amino acid similarity matrix (the PAM value) and the
cost of inserting gaps (INDEL cost).   Only one match per database sequence 
is recorded and ranked to give the best results.  By "local", we mean that it is
possible to detect short matching regions such as binding sites, in the middle
of long sequences.   


  The INDEL cost:
  ---------------

The INDEL cost is a penalty that is subtracted from the alignment score for 
every residue that has been inserted or deleted in the best local alignment.  
There is a lower limit on the allowed INDEL value because, if you set it too 
low, the alignment will be filled with many short gaps and will be biologically 
meaningless.  This lower limit will depend on the particulr PAM setting used.   
The default INDEL cost will usually work best in most cases.   Reducing this 
cost will encourage gaps; increasing it will discourage them.  The default and 
lower limit INDEL costs for a range of PAM values are given below:

                PAM setting    Default INDEL cost    INDEL cost lower limit.
              ------------------------------------------------------------
                      1                 53                     27
                     20                 29                     15
                     40                 22                     11
                     60                 19                     10
                     80                 16                      8
                    100                 14                      7
DEFAULT value ->    120                 13                      7 
                    150                 11                      6
                    200                  9                      5
                    250                  7                      4
                    300                  6                      3
                    400                  5                      3
                    500                  4                      2



  The PAM setting:
  ----------------

Dayhoff and co-workers [3] produced a series of amino acid weight matrices to 
help detect distant similarity between proteins.   These give weights to
the different possible pairs of aligned residues and are known as PAM matrices.
The weights can be positive or negative.  You are allowed to choose the 
particular PAM matrix by specifying a value between 1 and 500.   Roughly, 
low PAM values (e.g. 40 or so) will be best suited for finding short regions
of very strong similarity while high values (e.g. 250 or more) will be better 
suited for finding longer, weaker matches.   The default value of 120 is 
a compromise that works well in practice if you only decide to use one 
setting.   Frequently, the list of top scoring sequences will vary greatly
depending on the PAM value you use.  If you do not find any similarity
to a sequence with the default setting, try some other PAM values.  



Getting further help
--------------------

Enquiries about the MPsrch software or the algorithms used should be sent
to John Collins or Shane Sturrock (e-mail mpsrch_help@biocomp.ed.ac.uk).


Enquiries about the operation of the BLITZ server should be sent to 
NetHelp@EMBL-Heidelberg.DE




Literature
----------

[1] Sturrock, S.S. and Collins, J.F. (1993)  MPsrch version 1.3.  
    Biocomputing Research Unit, University of Edinburgh, UK.
      
[2] Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular
    subsequences. J. Mol. Biol. 147:195-197.

[3] Dayhoff, M., Schwartz, R.M. and Orcutt, B.C. (1978) (in) Dayhoff, M. (ed.) 
    Atlas of protein sequence and structure, vol. 5, suppl. 3, pp. 345-352.

[4] Collins, J.F. and Coulson, A.F.W. (1990)  Significance of protein sequence
    similarities.  Pages 474-486 in Doolittle, R.F. (ed.)
    Methods in Enzymology, volume 183, Academic Press.






APPENDIX I  The original documentation for MPsrch
-------------------------------------------------

This document is specific to running MPsrch interactively on a local machine.
The details are different to those used for running MPsrch via the EMBL
BLITZ server but the results and algorithm are identical.






                         MPsrch V1.3 User Guide

                    S. S. Sturrock and J. F. Collins
                       Biocomputing Research Unit
                        University of Edinburgh
                  (e-mail sss or jfc@biocomp.ed.ac.uk)

                                 6.1.93


1.  Introduction

MPsrch is an implementation of the Smith/Waterman Best Local Similarity 
algorithm for the MasPar family of parallel machines and will run on any
MasPar configuration from the minimum 1024 processor MP-1 system up to a 
16384 processor MP-2 system.  MPsrch can cope with any database size for 
the foreseeable future.

Typical search times for a 4096 processor MP-1 are about 40 seconds for a
377 residue query against Swiss-Prot Version 23 and this corresponds to 
84 million cell updates per second although peaks of 130 million plus can 
be attained with larger query sequences.


2.  Running Searches

MPsrch may be run in two ways, either interactively or via command line
arguments.  If any arguments are missing the program will prompt the user
for them unless running with the batch flag set in which case defaults will
be used if available.  There are no defaults for database, query name and
output file, and with the -b flag set no output is sent to the display.

1.1 Arguments available

-dbase.....Database to be searched
-query.....Filename of query to be used (FASTA format)
-pams......No. of Dayhoff PAMs in 1-500 range for comparison score table
-indel.....Indel penalty, range varies with PAMtable and stringency
-output....Filename for output
-align.....Number of alignments required in output file
-batch.....Suppresses screen output and user interaction

The first letter of each of these may be used instead of the whole argument.

1.2 An Example Search

The following is the screen display of a typical search using command line 
options.  The query used is an actin.

maspar> MPsrch -q actin.pep -d swiss -o foo -p 100 -i 14 -a 100
Welcome to MPsrch (Version 1.3)

(Copyright) Shane S. Sturrock & John F. Collins 
Biocomputing Research Unit, University of Edinburgh, UK

Query sequence 377 residues

Starting search

0.............................100%
 ----------------------------->

MasPar Time: 40.621008 seconds.
Mean 47.168988;  Variance 76.572116;  scale 0.616007
Aligned:   100

At this point the output file is in the user's directory.  The user's query
sequence is displayed for reference in the output header: complete if less 
than 50 residues long; with the first 20 and last 20 residues if longer.

maspar> more foo
Search started: Wed Jan  6 11:39:01 1993

MPsrch:         Version 1.3 - Shane S. Sturrock & John F. Collins 1993.
                Biocomputing Research Unit, University of Edinburgh, UK.

Title:          No title supplied
Description:    No descr supplied
Sequence:       1 mcdedettalvcdngsglvk..........witkqeydeagpsivhrkcf 377

Parameters:     swiss-prot23;  PAM 100;  Penalty 14;  Align 100 

Predicted No. is the number of results expected by chance to have a score 
greater than or equal to the score of the result being printed, and is 
derived by analysis of the total score distribution which gave:

Statistics:     Mean 47.168988;  Variance 76.572116;  scale 0.616007


RESULT    1     Score 3334;  Predicted No. 0.000000e+00;

ID   ACTS_HUMAN     STANDARD;      PRT;   377 AA.
DE   ACTIN, ALPHA SKELETAL MUSCLE.

          Matches 377;  Mismatches 0;  Partials 0;  Indels 0;  Gaps 0;

          ************************************************************
Db      1 MCDEDETTALVCDNGSGLVKAGFAGDDAPRAVFPSIVGRPRHQGVMVGMGQKDSYVGDEA 60
Q       1 mcdedettalvcdngsglvkagfagddapravfpsivgrprhqgvmvgmgqkdsyvgdea 60

          ************************************************************
Db     61 QSKRGILTLKYPIEHGIITNWDDMEKIWHHTFYNELRVAPEEHPTLLTEAPLNPKANREK 120
Q      61 qskrgiltlkypiehgiitnwddmekiwhhtfynelrvapeehptllteaplnpkanrek 120

          ************************************************************
Db    121 MTQIMFETFNVPAMYVAIQAVLSLYASGRTTGIVLDSGDGVTHNVPIYEGYALPHAIMRL 180
Q     121 mtqimfetfnvpamyvaiqavlslyasgrttgivldsgdgvthnvpiyegyalphaimrl 180

          ************************************************************
Db    181 DLAGRDLTDYLMKILTERGYSFVTTAEREIVRDIKEKLCYVALDFENEMATAASSSSLEK 240
Q     181 dlagrdltdylmkiltergysfvttaereivrdikeklcyvaldfenemataasssslek 240

          ************************************************************
Db    241 SYELPDGQVITIGNERFRCPETLFQPSFIGMESAGIHETTYNSIMKCDIDIRKDLYANNV 300
Q     241 syelpdgqvitignerfrcpetlfqpsfigmesagihettynsimkcdidirkdlyannv 300

          ************************************************************
Db    301 MSGGTTMYPGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTFQQMWIT 360
Q     301 msggttmypgiadrmqkeitalapstmkikiiapperkysvwiggsilaslstfqqmwit 360

          *****************
Db    361 KQEYDEAGPSIVHRKCF 377
Q     361 kqeydeagpsivhrkcf 377


For brevity only the first result is shown.  The query sequence is always
the lower of the two shown in the alignments.  Note that there are seperate 
values for indels and gaps; a gap is one or more adjacent indels and the 
implementation used guarantees to minimise the number of gaps while maintaining 
the correct score for the Smith Waterman algorithm.  Alignments displayed are 
produced by the same algorithm used for the searches.  Query sequences can be 
upper or lower case so it is possible for the user to highlight areas
of interest by mixing cases.  Query sequences may be up to the present maximum
of 10000 residues in length.  Search paramemters provided are variable PAM 
tables and user selectable indel penalties.  The user may determine which
provide the most sensitive results in any given case.  We recommend starting
with 100 PAMs and indel penalty of 14 (note the indel is given as a
positive number).

Predicted number is usually a good guide to likely biological interest in
the alignment.  However, short motifs or patterns may sometimes be
recognised as significant in conveying functional information even though
their scores may be low and the predicted number apparently rather high.

Here is another example, this time with the largest sequence in the swiss23
database.

maspar> MPsrch -q rynr.seq -d swiss -o foo -p 100 -i 14 -a 20
Welcome to MPsrch (Version 1.3)

(Copyright) Shane S. Sturrock & John F. Collins 
Biocomputing Research Unit, University of Edinburgh, UK

Query sequence 5037 residues

Starting search

0.............................100%
 ----------------------------->

MasPar Time: 343.918292 seconds.
Mean 61.006553;  Variance 148.332142;  scale 0.411283
Aligned:    20
maspar> more foo
Search started: Wed Jan  6 11:46:32 1993

MPsrch:         Version 1.3 - Shane S. Sturrock & John F. Collins 1993.
                Biocomputing Research Unit, University of Edinburgh, UK.

Title:          RYNR_RABIT
Description:    RYANODINE RECEPTOR, SKELETAL MUSCLE.
Sequence:       1 MGDGGEGEDEVQFLRTDDEV..........CWDFFPAGDCFRKQYEDQLS 5037

Parameters:     swiss-prot23;  PAM 100;  Penalty 14;  Align 20

Predicted No. is the number of results expected by chance to have a score 
greater than or equal to the score of the result being printed, and is 
derived by analysis of the total score distribution which gave:

Statistics:     Mean 61.006553;  Var 148.332142;  scale 0.411283


RESULT    1     Score 44624;  Predicted No. 0.000000e+00;

ID   RYNR_RABIT     STANDARD;      PRT;  5037 AA.
DE   RYANODINE RECEPTOR, SKELETAL MUSCLE.

          Matches 5037;  Mismatches 0;  Partials 0;  Indels 0;  Gaps 0;

          ************************************************************
Db      1 MGDGGEGEDEVQFLRTDDEVVLQCSATVLKEQLKLCLAAEGFGNRLCFLEPTSNAQNVPP 60
Q       1 MGDGGEGEDEVQFLRTDDEVVLQCSATVLKEQLKLCLAAEGFGNRLCFLEPTSNAQNVPP 60

......................[continues]

          *********************************************************
Db   4981 EEHNLANYMFFLMYLINKDETEHTGQESYVWKMYQERCWDFFPAGDCFRKQYEDQLS 5037
Q    4981 EEHNLANYMFFLMYLINKDETEHTGQESYVWKMYQERCWDFFPAGDCFRKQYEDQLS 5037

......................[miss some results out]


RESULT    4     Score 490;  Predicted No. 3.586030e-52;

ID   PCD6_MOUSE     STANDARD;      PRT;   500 AA.
DE   PROTEIN PCD-6 (FRAGMENT).

          Matches 67;  Mismatches 53;  Partials 36;  Indels 3;  Gaps 2;

          * *.. .  *. .. * .  .  *.*.***.** .  *. .*  *.  **..*. *** *
Db    269 ETEQDKEHTCETLLMCIVTVLSHGLRSGGGVGDVLRKPSKEE-PLFAARVIYDLLFFFMV 327
Q    4867 EDEDEPDMKCDDMMTCYLFHMYVGVRAGGGIGDEIEDPAGDEYELY--RVVFDITFFFFV 4924

          *.*.* .* *.***.*..** . .   * . * *****.  * **     ** *  ****
Db    328 IIIVLNLIFGVIIDTFADLRSEKQKKEEILKTTCFICGLERDKFDNKTVTFEEHIKEEHN 387
Q    4925 IVILLAIIQGLIIDAFGELRDQQEQVKEDMETKCFICGIGSDYFDTTPHGFETHTLEEHN 4984

          . .*. *.. .  ** ** ** ****  * .**  * **
Db    388 MWHYLCFIVLVKVKDSTEYTGPESYVAEMIRERNLDWFP 426
Q    4985 LANYMFFLMYLINKDETEHTGQESYVWKMYQERCWDFFP 5023


Experiment with various PAM and penalty settings to see which gives results 
you find plausible.  Beware of PAM settings close to the 500 limit along with 
very low penalties because this will reduce the sensitivity of the search to 
regions of good homology.  However, try it and see.  Sequences which have 
diverged to the extent that a 500 PAM table could be used, usually contain 
short strong regions of alignment, which can be more easily distinguished from 
noise using a less extreme PAM setting.

Note that '*' means an identity (match); '.' is a positive substitution
(partial); ' ' is a negative substitution (mismatch) or insertion/deletion 
(indel).

If you have any problems using the BLITZ service, or any questions, please
send them to:

	NETHELP@EMBL-Heidelberg.DE


Shane S. Sturrock was supported by a grant from the Human Genome Mapping
Project.
The Biocomputing Research Unit was supported by the Darwin Trust of Edinburgh.
________________________________________