HELP [BLITZ] May 12, 1993
Introduction
------------
BLITZ is an automatic electronic mail server for the MPsrch program of
Shane Sturrock and John Collins, Biocomputing Research Unit, University
of Edinburgh, Scotland [1].
MPsrch allows you to perform sensitive and extremely fast comparisons of your
protein sequences against the Swiss-Prot protein sequence database using the
Smith and Waterman best local similarity algorithm [2]. It runs on the MasPar
family of massively parallel machines; the BLITZ server uses a 4096-processor
MasPar MP-1 system. A typical search time for a query sequence of 400 amino
acids is approximately 40 seconds to search the entire Swiss-Prot 23 release.
Additional time is required to reconstruct the alignments; the time for this
will depend on the number of alignments requested. MPsrch is the fastest
implementation of the SW algorithm currently available on any machine.
This documentation describes how to use the EMBL BLITZ server to submit MPsrch
jobs. The original MPsrch documentation is included in an appendix at the
end of this file.
MPsrch answers the question: which sequences in the database are most similar
(or contain the most similar regions) to my query sequence.
How to use BLITZ
----------------
Send a properly formatted electronic mail message to
BLITZ@EMBL-Heidelberg.DE
containing some of the commands listed below and the answer will be
automatically mailed to you.
If you have any problems using the BLITZ service, or any questions, please
send them to:
NETHELP@EMBL-Heidelberg.DE
Example
-------
Here is a quick summary example (the parameters are explained below):
TITLE RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
PAM 200
INDEL 10
ALIGN 50
SEQ
STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL
FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE
YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM
TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ
VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
END
Databases available
-------------------
At present, only the latest release of the SwissProt protein database is
available. This is updated 4 times a year. In the future, SwissProt entries
will be available to be searched between releases as a seperate cumulative
database.
The Input Format
----------------
BLITZ is an automatic server that runs without any human intervention.
Therefore, it understands a limited set of commands. These are listed
below. Defaults are offered for all of the parameters used in the search
except the search sequence itself.
Here are some general rules:
- Your mail message must contain only one command per line.
- There is only one mandatory command, SEQ, unless you just want to get this
help file. All the search parameters are optional, and default values will be
used whenever they are not specified.
- You can use both uppercase and lowercase characters, or mix them.
- The order of the commands is not important, but SEQ is usually the last
one, since everything following this line will usually be treated as a
sequence (see below).
- Blank lines or space characters are accepted.
- Only one search per mail message is allowed.
Here is a list of valid commands that are accepted by BLITZ:
HELP Use this command to request this help document.
PAM n PAM matrix where "n" is a number between 1 and 500. This
(protein only) command sets the amino acid weight matrix that is used to
score non-identical amino acids in the search. They
are usually known as "Dayhoff matrices" [3]. The default
value of 120 is used if you do not specify a PAM matrix.
Initially, just use the default value i.e. leave this command
out. Using different PAM values can often have a dramatic
effect on which sequences you find in a search. A more
detailed description of why this is so is given at the end of
this document.
INDEL n Indel or gap penalty where "n" is a small integer, typically
in the range 5 to 30. If you omit this parameter, a default
will be used that depends on the PAM matrix chosen. The default
indel cost for a PAM matrix setting of 120 is 13 and must not
be set below 7. If you give an illegal value for indel cost,
a suitable default will be used. The indel cost actually used
in any search will be printed in the output file so that you
can change it if desired. Initially just omit this parameter.
Decrease this parameter to encourage gaps; increase it to
discourage them.
ALIGN n Number of best alignments to see; default is 30; maximum is 100.
NAMES n Number of scores to report (can be more than ALIGN)
TITLE s A one-line title for the search where "s" is text.
The first word will be used as the "name" of the query sequence
and the rest of the line will be used as a description.
e.g.
TITLE my_sequence test using PAM 120 of my sequence
will use "my_sequence" as the name and
"test using PAM 120 of my sequence" as description.
Do not use quotes or double quotes in the title!
SEQ (MANDATORY)
*Everything* following this line up to
a) the end of the mail message, or
b) a line starting with the word END
will be treated as part of the sequence.
Don't put sequence information on the same line on
which you put the SEQ command or the END command.
No special format of the query sequence is required.
You may include numbering, but make sure to remove all
sorts of comments and unrelated information like mail
signatures if you don't use END!
Examples of input files:
------------------------
This example will ask for the 50 best alignments using an indel cost of 10
and a PAM 200 weight matrix.
TITLE RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
PAM 200
INDEL 10
ALIGN 50
SEQ
1 STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL
51 FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE
101 YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM
151 TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ
201 VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
END
Alternatively, the example below will just use the defaults of PAM 120, INDEL
12 and ALIGN 30.
SEQ
STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL
FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE
YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM
TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ
VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
END
Current Restrictions
--------------------
The maximum number of alignments that you can request is 100.
The PAM matrix value must be between 1 and 500 inclusive.
The maximum query sequence length is 10000.
Only one job per mail request is allowed.
The BLITZ output
-----------------
After sending your query to EMBL you will receive 2 mail messages from the
BLITZ server:
1) a LOG file to indicate the status of your job;
2) the output from the MPsrch program.
The LOG file produced by the first example above is:
>From: Blitz@EMBL-Heidelberg.DE 5-JAN-1993 17:52:36.92
>To: Joe.Biologist@EMBL-Heidelberg.DE
>CC:
>Subj: Thanks for your call; here's the log ...
>
>
>TITLE RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
>PAM 200
>INDEL 10
>ALIGN 50
>SEQ
>
> 1 STKKKPLTQE QLEDARRLKA IYEKKKNELG LSQESVADKM GMGQSGVGAL
>
> 51 FNGINALNAY NAALLAKILK VSVEEFSPSI AREIYEMYEA VSMQPSLRSE
>
> 101 YEYPVFSHVQ AGMFSPELRT FTKGDAERWV STTKKASDSA FWLEVEGNSM
>
> 151 TAPTGSKPSF PDGMLILVDP EQAVEPGDFC IARLGGDEFT FKKLIRDSGQ
>
> 201 VFLQPLNPQY PMIPCNESCS VVGKVIASQW PEETFG
>END
>
>* An MPsrch batch job has been submitted to the BLITZ machine.
>* The following parameters are used:
>* Title: RPC1_Lambd this is a test using the Lambda CI repressor from Swiss-Prot.
>* Library to be searched: Swiss-Prot
>* Number of alignments: 50
>* PAM: 200
>* INDEL: 10
>* The result file will be mailed to you after completion.
The output file (only showing the first 3 and the last results in this case)
from the MPsrch program for the first example above is shown below.
The results from MPsrch
-----------------------
>From: Blitz@EMBL-Heidelberg.DE 5-JAN-1993 17:53:55.56
>To: Joe.Biologist@EMBL-Heidelberg.DE
>CC:
>Subj: Results for: RPC1_Lambd this is a test using the Lambda
CI repressor from Swiss-Prot.
>
>
>Search started: Tue Jan 5 17:51:00 1993
>
>MPsrch: Version 1.2 - Shane S. Sturrock & John F. Collins 1992.
> Biocomputing Research Unit, University of Edinburgh, UK.
>
>Title: RPC1_Lambd
>Description: this is a test using the Lambda CI repressor from Swiss-Prot.
>Sequence: 1 STKKKPLTQEQLEDARRLKA..........ESCSVVGKVIASQWPEETFG 236
>
>Parameters: swiss-prot23; PAM 200; Penalty 10; Align 50
>
>Predicted No. is the number of results expected by chance to have a score
>greater than or equal to the score of the result being printed, and is
>derived by analysis of the total score distribution which gave:
>
> Mean 16.296862; Variance 354.401516; scale 0.045984
The mean and variance of the distribution of scores from the entire database
are calculated. These are used later to help provide significance measures
for the strength of matches found in the search [4].
>
>Result #1
>>RPC1_LAMBD P03034 REPRESSOR PROTEIN CI.
The first hit with its Swiss-Prot one line title. In this case, this is
the protein we searched with so it is not surprising that we found it first.
>
> Score: 1370; Indels: 0; Gaps: 0; Predicted No.: 3.354311e-24;
> Matches: 236; Mismatches 0; Partials 0;
This is perfect hit with 0 gaps or indels and 236 matching residues (exact
matches). The predicted number is an estimate of the number of sequences
that we would expect to find randomly with a score greater than the observed
score of 1370; in this case it is extremely small (not surprisingly).
>
> ************************************************************
>Db 1 STKKKPLTQEQLEDARRLKAIYEKKKNELGLSQESVADKMGMGQSGVGALFNGINALNAY 60
> Q 1 STKKKPLTQEQLEDARRLKAIYEKKKNELGLSQESVADKMGMGQSGVGALFNGINALNAY 60
>
> ************************************************************
>Db 61 NAALLAKILKVSVEEFSPSIAREIYEMYEAVSMQPSLRSEYEYPVFSHVQAGMFSPELRT 120
> Q 61 NAALLAKILKVSVEEFSPSIAREIYEMYEAVSMQPSLRSEYEYPVFSHVQAGMFSPELRT 120
>
> ************************************************************
>Db 121 FTKGDAERWVSTTKKASDSAFWLEVEGNSMTAPTGSKPSFPDGMLILVDPEQAVEPGDFC 180
> Q 121 FTKGDAERWVSTTKKASDSAFWLEVEGNSMTAPTGSKPSFPDGMLILVDPEQAVEPGDFC 180
>
> ********************************************************
>Db 181 IARLGGDEFTFKKLIRDSGQVFLQPLNPQYPMIPCNESCSVVGKVIASQWPEETFG 236
> Q 181 IARLGGDEFTFKKLIRDSGQVFLQPLNPQYPMIPCNESCSVVGKVIASQWPEETFG 236
Stars are used to mark identical residues.
>
>
>Result #2
>>RPC2_BPP22 P03035 REPRESSOR PROTEIN C2.
>
> Score: 370; Indels: 8; Gaps: 5; Predicted No.: 4.309500e-04;
> Matches: 72; Mismatches 72; Partials 54;
>
> ... * . * ... ..*... ... . . *. * *.* *. * . . . .
>Db 13 RRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQCSPDYLLKGDLSQT 72
> Q 25 KKNELGLSQESVADKMGMGQSGVGALFNGINALNAYNAALLAKILKVSVEEFSPSIAREI 84
>
> * . .* *. **..* * ** . . * * * ** *. .***.
>Db 73 NVAYHS-RHEP--RG--SYPLISWVSAGQWMEAVEPYHKRAIENWHDTTVDCSEDSFWLD 127
> Q 85 YEMYEAVSMQPSLRSEYEYPVFSHVQAGMFSPELRTFTKGDAERWVSTTKKASDSAFWLE 144
>
> *.*.*****.* *.*.**.****** * . .*.* *. * *****. *.*. **
>Db 128 VQGDSMTAPAG--LSIPEGMIILVDPEVEPRNGKLVVAKLEGENEATFKKLVMDAGRKFL 185
> Q 145 VEGNSMTAPTGSKPSFPDGMLILVDPEQAVEPGDFCIARLGGD-EFTFKKLIRDSGQVFL 203
>
> .********* * .* ..* *. ..
>Db 186 KPLNPQYPMIEINGNCKIIGVVVDAK 211
> Q 204 QPLNPQYPMIPCNESCSVVGKVIASQ 229
>
The second hit is also against a bacteriophage repressor protein and its
score also looks significant as the predicted number of random matches at
this level is still very small (0.00043 hits expected randomly at this level).
Partial matches (residues that are judged to be "similar" according to the
weight matrix used; actually, those with a positive score) are marked with
dots.
Gaps are indicated by "-" characters; each gap is caused by the insertion or
deletion of one or more amino acids from one of the sequences at a site.
An indel is one item in a gap (just one "-" character). Seperate counts
are given for indels and gaps.
>
>Result #3
>>UMUD_ECOLI P04153 UMUD PROTEIN.
>
> Score: 112; Indels: 14; Gaps: 10; Predicted No.: 8.037992e+01;
> Matches: 41; Mismatches 50; Partials 33;
>
> . . ** .*.** ** * *.. . . . . ... . *.**
>Db 5 KPADLREIVTFPLFSDLVQCG-FPSPAADYVEQRID-LNQLLIQHPSATYFVKASGDSMI 62
> Q 93 MQPSLRSEYEYPVFSH-VQAGMFSPELRTFTKGDAERWVSTTKKASDSAFWLEVEGNSMT 151
>
> * ...** *..**. . **. ** . * *** *** .* . * * *.*. *
>Db 63 --DG---GISDGDLLIVDSAITASHGDIVIAAVDG-EFTVKKLQLRPTVQ--LIPMNSAY 114
> Q 152 APTGSKPSFPDGMLILVDPEQAVEPGDFCIARLGGDEFTFKKL-IRDSGQVFLQPLNPQY 210
>
> *. *. ... * * **
>Db 115 SPITISSEDTLDVFGVVI 132
> Q 211 -PM-IPCNESCSVVGKVI 226
>
>
We can skip the next 46 hits and just look at the 50th.
>
>Result #50
>>TAG1_RAT P22063 AXONAL GLYCOPROTEIN TAG-1 PRECURSOR.
>
> Score: 86; Indels: 4; Gaps: 4; Predicted No.: 2.809793e+02;
> Matches: 16; Mismatches 32; Partials 28;
>
> *.* *.. ** . . * .. .. .... ..*. *. **. . *
>Db 39 PIFEEQPIGLLFPE-ES-A-EDQVTLACRARASPPATYRWKMNGTDMNLEPGSRHQLMGG 95
> Q 104 PVFSHVQAGMFSPELRTFTKGDAERWVSTTKKASDSAFWLEVEGNSMTAPTGSKPSFPDG 163
>
> *... * .. ..* . *.*
>Db 96 NLVIMSPTKTQDAGVYQCLA 115
> Q 164 MLILVDPEQAVEPGDF-CIA 182
>
>Search completed: Tue Jan 5 17:51:44 1993
At this level of match (score = 86) you expect to see about 280 sequences
in the database randomly. Therefore we have no evidence for it to be a
genuine case of homology. On biological grounds it does not look very
likely either (a rat axonal glycoprotein).
Retrieving Database Entries
---------------------------
You can easily get a copy of matching sequences from the Swiss-Prot database by
using the EMBL File Server. You should use the accession number or entry name
as given in Swiss-Prot entries and the BLITZ output.
Send a mail message to
NETSERV@EMBL-Heidelberg.DE
containing one command per line. The general syntax is:
GET PROT:accnumber or
GET PROT:entryname
e.g. GET PROT:P22063 or
GET PROT:RPC1_LAMBD
If you are new to the EMBL File Server, send a mail message to
Netserv@EMBL-Heidelberg.DE
containing the line HELP to get some introductory information.
The File Server offers the latest sequence data, several other databases and
free software for molecular biology.
The Algorithm
-------------
MPsrch uses the well known Smith and Waterman [2] algorithm for searching
the database. Your query sequence is compared against all sequences in the
database and the best results, as judged by the alignment score, are aligned
and included in the output. The algorithm looks for the best "local" match
as determined by the amino acid similarity matrix (the PAM value) and the
cost of inserting gaps (INDEL cost). Only one match per database sequence
is recorded and ranked to give the best results. By "local", we mean that it is
possible to detect short matching regions such as binding sites, in the middle
of long sequences.
The INDEL cost:
---------------
The INDEL cost is a penalty that is subtracted from the alignment score for
every residue that has been inserted or deleted in the best local alignment.
There is a lower limit on the allowed INDEL value because, if you set it too
low, the alignment will be filled with many short gaps and will be biologically
meaningless. This lower limit will depend on the particulr PAM setting used.
The default INDEL cost will usually work best in most cases. Reducing this
cost will encourage gaps; increasing it will discourage them. The default and
lower limit INDEL costs for a range of PAM values are given below:
PAM setting Default INDEL cost INDEL cost lower limit.
------------------------------------------------------------
1 53 27
20 29 15
40 22 11
60 19 10
80 16 8
100 14 7
DEFAULT value -> 120 13 7
150 11 6
200 9 5
250 7 4
300 6 3
400 5 3
500 4 2
The PAM setting:
----------------
Dayhoff and co-workers [3] produced a series of amino acid weight matrices to
help detect distant similarity between proteins. These give weights to
the different possible pairs of aligned residues and are known as PAM matrices.
The weights can be positive or negative. You are allowed to choose the
particular PAM matrix by specifying a value between 1 and 500. Roughly,
low PAM values (e.g. 40 or so) will be best suited for finding short regions
of very strong similarity while high values (e.g. 250 or more) will be better
suited for finding longer, weaker matches. The default value of 120 is
a compromise that works well in practice if you only decide to use one
setting. Frequently, the list of top scoring sequences will vary greatly
depending on the PAM value you use. If you do not find any similarity
to a sequence with the default setting, try some other PAM values.
Getting further help
--------------------
Enquiries about the MPsrch software or the algorithms used should be sent
to John Collins or Shane Sturrock (e-mail mpsrch_help@biocomp.ed.ac.uk).
Enquiries about the operation of the BLITZ server should be sent to
NetHelp@EMBL-Heidelberg.DE
Literature
----------
[1] Sturrock, S.S. and Collins, J.F. (1993) MPsrch version 1.3.
Biocomputing Research Unit, University of Edinburgh, UK.
[2] Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular
subsequences. J. Mol. Biol. 147:195-197.
[3] Dayhoff, M., Schwartz, R.M. and Orcutt, B.C. (1978) (in) Dayhoff, M. (ed.)
Atlas of protein sequence and structure, vol. 5, suppl. 3, pp. 345-352.
[4] Collins, J.F. and Coulson, A.F.W. (1990) Significance of protein sequence
similarities. Pages 474-486 in Doolittle, R.F. (ed.)
Methods in Enzymology, volume 183, Academic Press.
APPENDIX I The original documentation for MPsrch
-------------------------------------------------
This document is specific to running MPsrch interactively on a local machine.
The details are different to those used for running MPsrch via the EMBL
BLITZ server but the results and algorithm are identical.
MPsrch V1.3 User Guide
S. S. Sturrock and J. F. Collins
Biocomputing Research Unit
University of Edinburgh
(e-mail sss or jfc@biocomp.ed.ac.uk)
6.1.93
1. Introduction
MPsrch is an implementation of the Smith/Waterman Best Local Similarity
algorithm for the MasPar family of parallel machines and will run on any
MasPar configuration from the minimum 1024 processor MP-1 system up to a
16384 processor MP-2 system. MPsrch can cope with any database size for
the foreseeable future.
Typical search times for a 4096 processor MP-1 are about 40 seconds for a
377 residue query against Swiss-Prot Version 23 and this corresponds to
84 million cell updates per second although peaks of 130 million plus can
be attained with larger query sequences.
2. Running Searches
MPsrch may be run in two ways, either interactively or via command line
arguments. If any arguments are missing the program will prompt the user
for them unless running with the batch flag set in which case defaults will
be used if available. There are no defaults for database, query name and
output file, and with the -b flag set no output is sent to the display.
1.1 Arguments available
-dbase.....Database to be searched
-query.....Filename of query to be used (FASTA format)
-pams......No. of Dayhoff PAMs in 1-500 range for comparison score table
-indel.....Indel penalty, range varies with PAMtable and stringency
-output....Filename for output
-align.....Number of alignments required in output file
-batch.....Suppresses screen output and user interaction
The first letter of each of these may be used instead of the whole argument.
1.2 An Example Search
The following is the screen display of a typical search using command line
options. The query used is an actin.
maspar> MPsrch -q actin.pep -d swiss -o foo -p 100 -i 14 -a 100
Welcome to MPsrch (Version 1.3)
(Copyright) Shane S. Sturrock & John F. Collins
Biocomputing Research Unit, University of Edinburgh, UK
Query sequence 377 residues
Starting search
0.............................100%
----------------------------->
MasPar Time: 40.621008 seconds.
Mean 47.168988; Variance 76.572116; scale 0.616007
Aligned: 100
At this point the output file is in the user's directory. The user's query
sequence is displayed for reference in the output header: complete if less
than 50 residues long; with the first 20 and last 20 residues if longer.
maspar> more foo
Search started: Wed Jan 6 11:39:01 1993
MPsrch: Version 1.3 - Shane S. Sturrock & John F. Collins 1993.
Biocomputing Research Unit, University of Edinburgh, UK.
Title: No title supplied
Description: No descr supplied
Sequence: 1 mcdedettalvcdngsglvk..........witkqeydeagpsivhrkcf 377
Parameters: swiss-prot23; PAM 100; Penalty 14; Align 100
Predicted No. is the number of results expected by chance to have a score
greater than or equal to the score of the result being printed, and is
derived by analysis of the total score distribution which gave:
Statistics: Mean 47.168988; Variance 76.572116; scale 0.616007
RESULT 1 Score 3334; Predicted No. 0.000000e+00;
ID ACTS_HUMAN STANDARD; PRT; 377 AA.
DE ACTIN, ALPHA SKELETAL MUSCLE.
Matches 377; Mismatches 0; Partials 0; Indels 0; Gaps 0;
************************************************************
Db 1 MCDEDETTALVCDNGSGLVKAGFAGDDAPRAVFPSIVGRPRHQGVMVGMGQKDSYVGDEA 60
Q 1 mcdedettalvcdngsglvkagfagddapravfpsivgrprhqgvmvgmgqkdsyvgdea 60
************************************************************
Db 61 QSKRGILTLKYPIEHGIITNWDDMEKIWHHTFYNELRVAPEEHPTLLTEAPLNPKANREK 120
Q 61 qskrgiltlkypiehgiitnwddmekiwhhtfynelrvapeehptllteaplnpkanrek 120
************************************************************
Db 121 MTQIMFETFNVPAMYVAIQAVLSLYASGRTTGIVLDSGDGVTHNVPIYEGYALPHAIMRL 180
Q 121 mtqimfetfnvpamyvaiqavlslyasgrttgivldsgdgvthnvpiyegyalphaimrl 180
************************************************************
Db 181 DLAGRDLTDYLMKILTERGYSFVTTAEREIVRDIKEKLCYVALDFENEMATAASSSSLEK 240
Q 181 dlagrdltdylmkiltergysfvttaereivrdikeklcyvaldfenemataasssslek 240
************************************************************
Db 241 SYELPDGQVITIGNERFRCPETLFQPSFIGMESAGIHETTYNSIMKCDIDIRKDLYANNV 300
Q 241 syelpdgqvitignerfrcpetlfqpsfigmesagihettynsimkcdidirkdlyannv 300
************************************************************
Db 301 MSGGTTMYPGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTFQQMWIT 360
Q 301 msggttmypgiadrmqkeitalapstmkikiiapperkysvwiggsilaslstfqqmwit 360
*****************
Db 361 KQEYDEAGPSIVHRKCF 377
Q 361 kqeydeagpsivhrkcf 377
For brevity only the first result is shown. The query sequence is always
the lower of the two shown in the alignments. Note that there are seperate
values for indels and gaps; a gap is one or more adjacent indels and the
implementation used guarantees to minimise the number of gaps while maintaining
the correct score for the Smith Waterman algorithm. Alignments displayed are
produced by the same algorithm used for the searches. Query sequences can be
upper or lower case so it is possible for the user to highlight areas
of interest by mixing cases. Query sequences may be up to the present maximum
of 10000 residues in length. Search paramemters provided are variable PAM
tables and user selectable indel penalties. The user may determine which
provide the most sensitive results in any given case. We recommend starting
with 100 PAMs and indel penalty of 14 (note the indel is given as a
positive number).
Predicted number is usually a good guide to likely biological interest in
the alignment. However, short motifs or patterns may sometimes be
recognised as significant in conveying functional information even though
their scores may be low and the predicted number apparently rather high.
Here is another example, this time with the largest sequence in the swiss23
database.
maspar> MPsrch -q rynr.seq -d swiss -o foo -p 100 -i 14 -a 20
Welcome to MPsrch (Version 1.3)
(Copyright) Shane S. Sturrock & John F. Collins
Biocomputing Research Unit, University of Edinburgh, UK
Query sequence 5037 residues
Starting search
0.............................100%
----------------------------->
MasPar Time: 343.918292 seconds.
Mean 61.006553; Variance 148.332142; scale 0.411283
Aligned: 20
maspar> more foo
Search started: Wed Jan 6 11:46:32 1993
MPsrch: Version 1.3 - Shane S. Sturrock & John F. Collins 1993.
Biocomputing Research Unit, University of Edinburgh, UK.
Title: RYNR_RABIT
Description: RYANODINE RECEPTOR, SKELETAL MUSCLE.
Sequence: 1 MGDGGEGEDEVQFLRTDDEV..........CWDFFPAGDCFRKQYEDQLS 5037
Parameters: swiss-prot23; PAM 100; Penalty 14; Align 20
Predicted No. is the number of results expected by chance to have a score
greater than or equal to the score of the result being printed, and is
derived by analysis of the total score distribution which gave:
Statistics: Mean 61.006553; Var 148.332142; scale 0.411283
RESULT 1 Score 44624; Predicted No. 0.000000e+00;
ID RYNR_RABIT STANDARD; PRT; 5037 AA.
DE RYANODINE RECEPTOR, SKELETAL MUSCLE.
Matches 5037; Mismatches 0; Partials 0; Indels 0; Gaps 0;
************************************************************
Db 1 MGDGGEGEDEVQFLRTDDEVVLQCSATVLKEQLKLCLAAEGFGNRLCFLEPTSNAQNVPP 60
Q 1 MGDGGEGEDEVQFLRTDDEVVLQCSATVLKEQLKLCLAAEGFGNRLCFLEPTSNAQNVPP 60
......................[continues]
*********************************************************
Db 4981 EEHNLANYMFFLMYLINKDETEHTGQESYVWKMYQERCWDFFPAGDCFRKQYEDQLS 5037
Q 4981 EEHNLANYMFFLMYLINKDETEHTGQESYVWKMYQERCWDFFPAGDCFRKQYEDQLS 5037
......................[miss some results out]
RESULT 4 Score 490; Predicted No. 3.586030e-52;
ID PCD6_MOUSE STANDARD; PRT; 500 AA.
DE PROTEIN PCD-6 (FRAGMENT).
Matches 67; Mismatches 53; Partials 36; Indels 3; Gaps 2;
* *.. . *. .. * . . *.*.***.** . *. .* *. **..*. *** *
Db 269 ETEQDKEHTCETLLMCIVTVLSHGLRSGGGVGDVLRKPSKEE-PLFAARVIYDLLFFFMV 327
Q 4867 EDEDEPDMKCDDMMTCYLFHMYVGVRAGGGIGDEIEDPAGDEYELY--RVVFDITFFFFV 4924
*.*.* .* *.***.*..** . . * . * *****. * ** ** * ****
Db 328 IIIVLNLIFGVIIDTFADLRSEKQKKEEILKTTCFICGLERDKFDNKTVTFEEHIKEEHN 387
Q 4925 IVILLAIIQGLIIDAFGELRDQQEQVKEDMETKCFICGIGSDYFDTTPHGFETHTLEEHN 4984
. .*. *.. . ** ** ** **** * .** * **
Db 388 MWHYLCFIVLVKVKDSTEYTGPESYVAEMIRERNLDWFP 426
Q 4985 LANYMFFLMYLINKDETEHTGQESYVWKMYQERCWDFFP 5023
Experiment with various PAM and penalty settings to see which gives results
you find plausible. Beware of PAM settings close to the 500 limit along with
very low penalties because this will reduce the sensitivity of the search to
regions of good homology. However, try it and see. Sequences which have
diverged to the extent that a 500 PAM table could be used, usually contain
short strong regions of alignment, which can be more easily distinguished from
noise using a less extreme PAM setting.
Note that '*' means an identity (match); '.' is a positive substitution
(partial); ' ' is a negative substitution (mismatch) or insertion/deletion
(indel).
If you have any problems using the BLITZ service, or any questions, please
send them to:
NETHELP@EMBL-Heidelberg.DE
Shane S. Sturrock was supported by a grant from the Human Genome Mapping
Project.
The Biocomputing Research Unit was supported by the Darwin Trust of Edinburgh.
________________________________________