BLOCKS e-mail server

        ___________               ___________               ___________ 
       |\ __________\            |___________|            /__________ /|
       | |           |           |           |           |           | |
       | | **********|           |***********|           |********** | |
       | | * BLOCKS  |           |   E-MAIL  |           |SEARCHER * | |
       | | **********|           |***********|           |********** | |
        \|___________|__v. 1.3___|___________|__9/3/93___|___________|/
                     |\ __________\         /__________ /|
                     | |   S Agus  |       |    Fred   | |
                     | |  B Alford |       | Hutchinson| |
                     | | J Henikoff|       |   Center  | |
                     | | S Henikoff|       |    1993   | |
                      \|___________|       |___________|/

As an aid to detection and verification of protein sequence homology, the 
BLOCKS e-mail searcher compares a protein or DNA sequence to the current 
database of protein blocks.  Blocks are short multiply aligned ungapped 
segments corresponding to the most highly conserved regions of proteins. 
A database of blocks has been constructed by successive application
of the automated PROTOMAT system (1) to individual entries in the PROSITE
catalog of protein groups (2) keyed to the SWISS-PROT protein sequence
databank (3).  BLOCKS version 6.2 consists of 2289 blocks derived from 619
different groups represented in PROSITE v. 10.2 keyed to SWISS-PROT 26. 
BLOCKS is updated following each significant update of PROSITE, which 
generally occurs every 6 months (January and July). An exception is the 
current version of BLOCKS, a minor update from BLOCKS v. 6.0, necessitated
by the delay of PROSITE 11.0 until the release of SWISS-PROT 27 (October).

The rationale behind searching a database of blocks is that information from
multiply aligned sequences is present in a concentrated form, reducing
background and increasing sensitivity to distant relationships.  This
information is represented in a position-specific scoring table or "profile"
(4), in which each column of the alignment is converted to a column of a table
representing the frequency of occurrence of each of the 20 amino acids.  For
searching a database of blocks, the first position of the sequence is aligned
with the first position of the first block, and a score for that amino acid is
obtained from the profile column corresponding to that position.  Scores are
summed over the width of the alignment, and then the block is aligned with the
next position.  This procedure is carried out exhaustively for all positions
of the sequence for all blocks in the database, and the best alignments
between a sequence and entries in the BLOCKS database are noted.  If a
particular block scores highly, it is possible that the sequence is related to
the group of sequences the block represents.  Typically, a group of proteins
has more than one region in common and their relationship is represented as a
series of blocks separated by unaligned regions.  If a second block for a
group also scores highly in the search, the evidence that the sequence is
related to the group is strengthened, and is further strengthened if a third
block also scores it highly, and so on.

                          ###################
                          # A Quick Summary #
                          ###################

Internet address:		blocks@howard.fhcrc.org

Message options are

	Send current help file:		help

	Search a protein or DNA:	>My sequence
					MCKTASE....	

	Retrieve information:		get BL00044

	Problems:			henikoff@howard.fhcrc.org

                          ################
                          # Getting help #
                          ################

The current version of this file is returned when the single word HELP appears
on the subject line or in the body of an otherwise blank e-mail message to the 
following Internet address:

                       BLOCKS@HOWARD.FHCRC.ORG

You can obtain the complete BLOCKS database and PROSITE catalog from the
repository of the National Center for Biological Information via ftp ('ftp
ncbi.nlm.nih.gov' log in as 'anonymous', give your e-mail address as password,
then 'cd repository/blocks' or 'cd repository/prosite').  PROTOMAT software
and documentation for DOS and UNIX machines are also available from the
repository.  Ftp instructions are found in the Announce file in
repository/blocks.  If human help is required or if you find a bug, please
send a message to henikoff@howard.fhcrc.org.  Since we do not save any queries
sent to the server, nor any results sent out, please include these in your
message.

                     ############################
                     # Sending a query sequence #
                     ############################

The BLOCKS e-mail searcher is based on the PATMAT searching tool (5) which
accepts protein or DNA sequences in FASTA, Genepro, GenBank, EMBL, SWISS-PROT,
GCG or PIR formats.  DNA sequences are translated in all 6 frames for
searching.  When FASTA, Genepro or GCG format is used, PATMAT decides that the
sequence is protein if an alphabet character other than G, A, T, C or N (upper
or lower case) is found.  This will cause misinterpretation if IUBPAC
ambiguities are present in DNA sequence.  For protein, the 20 amino acid
characters are allowed, with the other 6 alphabet characters interpreted as
mismatches.  Non alphabet characters including numbers and symbols are
ignored.  Here is an example of a protein query in FASTA format (in which the
first line begins with ">" and is not recognized as sequence):

To: blocks@howard.fhcrc.org
Subject:
>YCZ2_YEAST   Hypothetical 40.1 KD protein in HMR 3' region
MKAVVIEDGKAVVKEGVPIPELEEGFVLIKTLAVAGNPTDWAHIDYKVGPQGSILGCDAA
GQIVKLGPAVDPKDFSIGDYIYGFIHGSSVRFPSNGAFAEYSAISTVVAYKSPNELKFLG
EDVLPAGPVRSLEGAATIPVSLTTAGLVLTYNLGLNLKWEPSTPQRNGPILLWGGATAVG
QSLIQLANKLNGFTKIIVVASRKHEKLLKEYGADQLFDYHDIDVVEQIKHKYNNISYLVD
CVANQNTLQQVYKCAADKQDATVVELTNLTEENVKKENRRQNVTIDRTRLYSIGGHEVPF
GGITFPADPEARRAATEFVKFINPKISDGQIHHIPARVYKNGLYDVPRILEDIKIGKNSG
EKLVAVLN

The subject heading is ignored.  In the body of the message, include only a
single sequence.  The typical search requires only a few minutes if queries
are not long DNA sequences.

       ##########################################################
       # Interpreting results of a search using a protein query #
       ##########################################################

The blocks searcher queries the current version of BLOCKS, noting the 400
highest scoring blocks aligning with a protein query.  The basis for this rank
ordering is a score that is obtained by dividing the "raw" alignment score
described above by a 99.5 percentile calibration score for the block and
multiplying by 1000 (1).  The purpose of the calibration score is to allow
blocks to be compared to one another, even though they are quite heterogenous
in width (4-60 aa), in number of sequences (2->300) and in the degree of
similarity between sequences in each block.  For further details, see
Reference 1.  A score above 1000 is expected for 0.5% of the blocks in the
search using a protein query of average size.  Since there are about 2000
blocks, a protein of average size is expected to score about 10 blocks above
1000 by chance alone.  Up to 10 blocks with score > 1000, called "anchor
blocks", are reported. If additional "supporting blocks" for one of 
the PROSITE groups represented by these top 10 blocks is also detected among 
the highest scoring blocks, these are reported as well.  For the query example
above, here is part of the output that is returned:

Each numbered result consists of one or more blocks from a PROSITE group found
in the query sequence. One set of the highest-scoring blocks that are in the 
correct order and separated by distances comparable to the BLOCKS database is 
selected for analysis. If this set includes multiple blocks the probability 
that the lower scoring blocks support the highest scoring block is estimated. 
Maps of the database blocks and query sequence are shown:
AAA represents the first block roughly in proportion to its width.
  : represents the minimum distance between blocks in the database.
  . represents the maximum distance between blocks in the database.
< > indicate the sequence has been truncated to fit the page.
The query map is aligned on the highest scoring block. Multiple block hits 
that are consistent with the highest scoring block are separated by colons.
Block hits that are not consistent are mapped below. The alignment of the
query sequence with the sequence closest to it in the BLOCKS database is
shown. The distance between detected blocks is listed as (min, max): for the
database entry followed by the distance in the query. Upper case in the query
indicates at least one occurrence of the residue in that column of the block.

The following nomogram is provided to aid interpretation of scores and 
probability estimates. It summarizes results from 7,082 searches using 
shuffled protein queries. "Shuffled percentiles" are calculated from scores
obtained for the top-scoring block in each search:

         1600 *-----------|-----------|-----------|-----------|100.00 S
       A      |  \                                            |       H
       N      |    \             < 1/7000 searches            |       U
       C 1500 -       \                                       - 99.93 F
       H      *         \                                     |       F
       O      | \         * - - - - - *                       |       L
       R 1400 -   \                     \                     - 99.7  E
              |     \      < 1/1000        \                  |       D
       B      *        \          searches    \               |
       L 1300 - \         * - - -                 * - - - - - * 98    P
       O      |     \              \                          |       E
       C      |        \              *  - - -                |       R
       K 1200 -           *                    \              - 87    C
              |             \                     *           |       E
       S      |               \    < 1/100          \         |       N
       C 1100 -                 \         searches    \       - 41    T
       O      |                    \                     \    |       I
       R      |                       *                     \ |       L
       E 1000 |-----------|-----------|-----------|-----------|  1    E
              0          -1          -2          -3          -4
                                   LOG(P)

So, for example, a hit with anchor block score of 1200 (87th percentile) and 
estimated P-value of .001 is expected to occur by chance alone less than 
once in 1000 searches but more than once in 7000 searches.

For interpretation of block hits, you might find it worthwhile to obtain the
full set of blocks and PROSITE entries for a group. To do this, send a blank
message to blocks@howard.fhcrc.org with the subject heading: GET BL0???? e.g.,
'Subject: GET BL00044' for full information on this group.
=============================================================================
Query=>YCZ2_YEAST  HYPOTHETICAL 40.1 KD PROTEIN IN HMR 3'REGION., 
 Size=368 Amino Acids
Database=/data/blocks_6.0/blocks.dat, Blocks Searched=2302

1.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00059A   1     1   1310  2439         2-    42 Zinc-containing alcohol dehyd
BL00059A 371     1    825  2439         0-    40 Zinc-containing alcohol dehyd
BL00059B  15     1    984  1967        52-    77 Zinc-containing alcohol dehyd
BL00059C 105     1    891  2795        77-   134 Zinc-containing alcohol dehyd
BL00059D   2     1   1232  2388       174-   229 Zinc-containing alcohol dehyd

1310=98.5th percentile of anchor block scores for shuffled queries
P<1.4e-06 for BL00059D BL00059B in support of BL00059A
                         |-----  108 residues----|
   BL00059 AAAAAAAAA::.BBBBBB::........CCCCCCCCCCCCC:::...DDDDDDDDDDDDD
>YCZ2_YEAS AAAAAAAAA::BBBBBB::::::::::::::::::::::DDDDDDDDDDDDD
>YCZ2_YEAS A (1,35):1
ADHX_HORSE 9     AAVAWEAGKPVSIEEVEVAPPKAHEVRIKIIATAVCHTDAY
                  ||  | || |  | |         | ||  | |   ||
>YCZ2_YEAS 2     KAVVIEdGKaVVkEgVPiPELeEGfVLIKtLAVAgnpTDwa

BL00059B   A<->B (10,14):9
ADH3_ASPNI 62    PLIGGHEGAGVVVAKGELVKDEDFKI
                   | |   ||  |  |  |   || |
>YCZ2_YEAS 52    GsILGcdAAGqIVKLGPaVdpkDFsI

BL00059D   B<->D (78,122):96
 ADH_CLOBE 173   IGIGAVGLMGIAGAKLRGAGRIIGVGSRPICVEAAKFYGATDILNYKNGHIVDQVM
                  |  |||   |  |        | |          | |||     |     | |
>YCZ2_YEAS 174   gGAtAVGqSLIQlAnKlnGftkIIVvAsrKhEKLlKEYGADqlfDYhDiDvVeQIk

2.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00458C   3     1   1077  2417       278-   304 Natriuretic peptides receptor

1077=27.7th percentile of anchor block scores for shuffled queries

                         |-----  320 residues----|
   BL00458 AAA:::::::BBB:::::.CCDDD...............................EEEE
>YCZ2_YEAS <::::::::::::::::::CC

BL00458C    <->C (319,366):277
ANPC_HUMAN 355   NMFVEGFHDAILLYVLALHEVLRAGYS
                 |       |   ||    |||   |
>YCZ2_YEAS 278   NrrqnvtiDrtrLYsiggHEVpfgGiT   

The best hit in the entire database is for BL00059A, the first of 3 blocks for
the family represented in PROSITE entry PS00059, the zinc-containing alcohol
dehydrogenases.  Notice that 2 other blocks for this family, BL00059D
(ranking 2nd with a score of 1232) and BL00059B (ranking 15th with a score of
984) also were among the top 400 hits.  All 3 blocks are what are considered
to be "strong" (See Ref. 1 for an explanation of block strength, an empirical
measure of block specificity.  Blocks with strengths close to 1000 are not
sufficiently specific to be of much value in a search).  The 3 blocks align
with the query sequence in the same order as for the sequences that are
represented in the blocks, that is, A->B->D.  This is most easily seen in the
block map.  This map also shows that the distance between the 3 blocks
representing this group is not dissimilar from the distance between the
segments of the query that align with these blocks.  Therefore, the query
might be a member of this family.  Further evidence that this is the case
comes from examination of the alignment of each query segment detected with
the closest single member of the group.  For BL00059A, which aligns with an
N-terminal segment of the query, the best single sequence is an N-terminal
segment of ADHX_HORSE.  Different members of this family align with the B and
D blocks, nevertheless, the position of the alignment within each sequence is
consistent with the position within the query.  For example, the distance
between A and B varies from 10 to 14 in known members of this family and is 9
for the query. Lower scoring hits among the top 400 in the search are reported
and mapped below. In fact, a good candidate C block with rank 105 slightly 
overlaps the B block. Additional cases in which these other mapped alignments 
are useful are when there are duplicated domains or when a block alignment is 
interrupted by a small intron or (for DNA) a frameshift error, leading to 
multiple alignments of a single block within a sequence.

Intuitively, it seems unlikely that 3 high scoring blocks would align with
correct distances in between by chance alone.  But how unlikely?  First, let's
assume that the alignment with the A block occurred by chance (although its
position near the N-terminus and the quite high score of 1310 argues against
this assumption).  We then can ask, what is the probability that the B and C
blocks are also chance alignments?  Since these were detected independently of
the A block, finding a low probability of chance alignment for these 2 blocks
can confirm our intuition that there is a relationship.  These probabilities
are based on the rank of each block hit, the sizes of the query sequence and
the database, and the observed distances between blocks and the number of
blocks in a group (See Ref. 6 for further details).  In this case, the
probability that the B and C blocks also have aligned with the query by chance
is estimated to be less than about 1 in 700,000 ("P<1.4e-06 for BL00059D 
BL00059B in support of BL00059A").  Combined with the fact that the A block 
was the best hit in the database, we conclude that the query is a member of 
the zinc-containing alcohol dehydrogenase family.  Note from the nomogram 
that a combination of anchor block score = 1310 (98.5 %-ile of first hits) 
and P<1.4e-06 is better than was seen in 7082 searches using shuffled protein
queries. This is a distant member of a large family, apparently one not easily 
detected using other approaches. The query is an open reading frame from yeast 
chromosome 3 not reported to be a member of any family either in the original 
study (7) or in a subsequent more intensive analysis of ORFs from this 
chromosome (8).

The second ranking hit is typical of a chance alignment.  A score of 1077 is
unremarkable in a search using an average sized query, and this is reflected
in the fact that a score this high ranks only in the 27th percentile of first
hits for 7082 searches of shuffled protein queries.  Also, the alignment
with the closest sequence in the block, with 8 identities over a stretch of
27 amino acids also seems unremarkable given lack of any biological
information on the yeast chromosome 3 ORF.  Even though up to 10 block hits
in a search are reported, one should be increasingly cautious about single
block alignments as ranks decrease.  Note that no P-values are available for
single block hits.  Note also that the P-value for blocks in support becomes
less meaningful as one goes down the list; P-values above about 1/100 in
support of a block that does not score at the top should be regarded with
more caution.

          ######################################################
          # Interpreting results of a search using a DNA query #
          ######################################################

Results for a DNA query are analyzed similarly to those for a protein
query, except that here it is assumed that multiple block alignments with a
single sequence might be detected in different frames because of frameshift
errors in the sequence.  In reporting the location of an alignment, each frame
is translated as if it were a single protein, so that a 4 kb sequence consists
of 3 predicted proteins of about 1300 aa beginning at the 5' end (+ frames)
and 3 others beginning at the 3' end of the sequence (- frames).  Here are 
results from a search using the 4 kb sequence of the Pseudomonas putida dgd
region obtained from GenBank:

1.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00061A   6    -1   1289  1336       147-   159 Short-chain alcohol dehydroge
BL00061B   3    -3   1542  1359       211-   221 Short-chain alcohol dehydroge
BL00061C   1    -1   1997  2053       265-   316 Short-chain alcohol dehydroge

1997=100.00th percentil of anchor block scores for shuffled queries
P<2.2e-07 for BL00061B BL00061A in support of BL00061C
                         |-----  220 residues----|
   BL00061 A::::::....................................B::::.....CCCCCC
P.cepacia                         :::::::::::::::::A::::::B:::::CCCCCC

BL00061A    <->A (2,36):146      BL00061B   A<->B (52,372):51
DHES_HUMAN 3     TVVLITGCSSGIG   PHBB_ZOORA 74    GPIDVLVNNAG
                   ||||||||| |                    |||||||||||
P.cepacia  147   KTVLITGCSSGFG   P.cepacia  211   GPIDVLVNNAG

BL00061C   B<->C (37,81):43
2BHD_STREX 132   GSIVNISSAAGLMGLALTSSYGASKWGVRGLSKLAAVELGTDRIRVNSVHPG
                 | |||  |   |  | |   | |||  |       ||||     |   | ||
P.cepacia  265   GVIVNVTSSVtLKvLPLVgAYrASKAAVNAFTESMAVELePFGVRAHLVLPG

2.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00600A   2     3   1680  2346       498-   532 Aminotransferases class-III p
BL00600B   5     3   1352  1656       668-   681 Aminotransferases class-III p
BL00600C 110     3   1032  1237       707-   717 Aminotransferases class-III p
BL00600C 180     3   1007  1237       881-   891 Aminotransferases class-III p

1680=100.00th percentile of anchor block scores for shuffled queries
P< 5.9e-05 for BL00600B BL00600C in support of BL00600A
                         |-----  145 residues----|
   BL00600  AAAAAA:::::::::::::::::::::.........BB::::.....CC:::......DD
P.cepacia  A (31,84):497
ARGD_ECOLI 33    GSRIWDQQGKEYVDFAGGIAVTALGHCHPALVNAL
                 ||   |  |    ||  |     ||||||  |
P.cepacia  498   GSfVYDaDGRAiLDFTSGemSAVLGHCHPEIVsVi

BL00600B   A<->B (119,169):135    BL00600C   B<->C (25,54):25
GATA_ASPNI 286   VAAIIVEPIQSEGG   GATA_ASPNI 325   DEVQTGVGATG
                  || | ||| | ||                    || ||||| ||
P.cepacia  668   lAAFIaEPIlSsGG   P.cepacia  707   DEaQTGVGRTG

3.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00044    4    -2   1461  2175       897-   954 Bacterial regulatory proteins

1461=99.88th percentile of anchor block scores for shuffled queries

BL00044x    <->x (0,31):896
ICIA_ECOLI 4     PDYRTLQALDAVIRERGFERAAQKLCITQSAVSQRIKQLENMFGQPLLVRTVPPRPTE
                      |     || |     ||     |||| ||  | ||    ||||    | |
P.cepacia  897   sLeiDLlrsfVVIaEvRalSAAARVGRTQSALSQQmKRLEDivDQPLLpAHRPRRgaD

4.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00573B   7    -3   1237  1800       548-   560 Pyridine nucleotide-disulphid

1237=93.7th percentile of anchor block scores for shuffled queries
                         |-----  129 residues----|
   BL00573 AAAA::BBB::::::::::.CCCCCCCCCCC:::::::::::::.DD::::.EEEEEEEE
P.cepacia       B (38,246):547
APHF_SALTY 246   GGQVLDTVDIENY
                 |||  ||    |
P.cepacia  548   GGQrsDTcvV*Nr

Note that 3 of the 6 highest ranking blocks belong to the short-chain alcohol
dehydrogenase family and are correctly spaced.  For 2 of the block alignments,
striking regions of identity with the closest segment in the block are seen,
further confirming that this sequence includes a member of this family.  The
fact that for all 3 blocks the scores are close to the respective strengths
indicates that this predicted protein is about as distant from other members
of the family as is the typical member.  But these blocks are in different
frames on the - strand, indicating multiple frameshift errors in the query
sequence.  The second highest ranking block is supported by 2 other
high-ranking and correctly-spaced blocks in the +3 frame.  This is a known
member of the Class III aminotransferase family (the DgdA protein) upstream of
and convergent to the dehydrogenase homology.  Although the D block was not
detected among the top blocks, a P-value of about 6 in 100,000 in support 
of a very high-scoring block confirms the assignment of DgdA to the family.  

The third hit is an alignment with the single block representing the LysR 
family of bacterial regulatory proteins.  Since there can be no blocks in 
support, one must evaluate this alignment based on whether the score is 
reasonably high (1461 is >100 higher than the next best hit) and whether this 
makes biological sense.  In fact, this is the N-terminal portion of the DgdR 
protein, which regulates the dgdA gene.  Combined with the fact that like 
known LysR family members, DgdR is just upstream and oppositely oriented from 
a gene it regulates, and the observation that this region of LysR family 
members is always N-terminal, we can conclude that the DgdR protein is a 
member of the family, though very distant from even its closest  relative.  
So the search has detected members of 3 families oriented: 
5'<--LysR-- --aminotransferase--> <--dehydrogenase-- 3'.

          ##########################################################
          # Getting the blocks and PROSITE information for a group #
          ##########################################################

Following up a potentially interesting hit is often aided by examining the
full set of blocks for a group.  Furthermore, since each group in BLOCKS
corresponds to a group in PROSITE, the excellent PROSITE annotations give
further information and provide useful references.  To obtain this information
for a single group, send a blank message to blocks@howard.fhcrc.org with the
subject heading: 'get BL00???' (Note that the 0s are zeros).  For example,
to obtain the 3 blocks representing the zinc-containing alcohol 
dehydrogenases, the message should be:

To: blocks@howard.fhcrc.org
Subject: GET BL00059

                           ##############
                           # References #
                           ##############

1. Henikoff S, Henikoff JG: Automated assembly of protein blocks for database
searching. Nucleic Acids Res. 1991, 19:6565-6572.

2. Bairoch A: PROSITE: A dictionary of sites and patterns in proteins. Nucleic
Acids Res. 1992, 20:2013-2018.

3. Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bank. Nucleic
Acids Res. 1992, 20:2019-2022.

4. Gribskov M, McLachlan AD, Eisenberg D: Profile analysis: detection of
distantly related proteins. Proc. Natl. Acad. Sci. USA 1987, 84:4355-4358.

5. Wallace JC, Henikoff S: PATMAT: a searching and extraction program for
sequence, pattern, and block queries and databases. CABIOS 1992, 8:249-254.

6. Henikoff S: Detection of Caenorhabditis transposon homologs in diverse
organisms. New Biol. 1992, 4:382-388.

7. Oliver SG et al.: The complete DNA sequence of yeast chromosome III. Nature
1992, 357:38-46.

8. Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E: What's
in a genome? Nature 1992, 358:287.

                         ##########################
                         # Changes in Version 1.3 #
                         ##########################

1. A new searching module, BLIMPS (BLocks IMProved Searcher) by Bill Alford, 
replaces PATMAT. While the basic searching methodology is the same, some 
minor differences will lead to slightly different scores, percentiles and 
P-values. Searches are faster so that response times should decrease. BLIMPS 
was used to calibrate BLOCKS v. 6.2.

2. With BLOCKS v. 6.2, the amino acid substitution used by the PROTOMAT system
for making blocks has been changed from BLOSUM 60 to BLOSUM 62. These two 
matrices are very similar, although BLOSUM 62 was found to perform measurably 
better in tests of pairwise alignments using BLAST (see Henikoff, S. and 
Henikoff, J.G. PNAS 89:10915-10919, 1992 for a description of the BLOSUM 
series). The change was made to conform with the use of BLOSUM 62 in other 
applications. There are differences in many of the blocks as a result of this 
change.

3. Beginning with BLOCKS v. 6.2, the field "seqs=" has been added to the "BL" 
line; it is the number of sequences in the block.

4. With the change from PATMAT to BLIMPS, IUB-PAC ambiguities (non-GATC bases) 
are accepted and are translated appropriately. This should not affect the 
scheme by which the server distinguishes a DNA from a protein sequence. Still,
it is formally possible that a DNA sequence with large numbers of ambiguous 
bases will appear to be protein, a circumstance that is easily detected in 
the report (e.g., Size=??? Amino Acids). For proteins, B (D or N) and Z (E 
or Q) are accepted and scored reasonably, and X is given a score of zero.

5. A manuscript describing the basis for scores, P-values (actually a proxy
for an "expectant value") is now available as a postscript file by anonymous 
ftp (ftp sparky.fhcrc.org, cd blocks, binary, get blockman.ps).


                         ##########################
                         # Changes in Version 1.2 #
                         ##########################

1. The percentile of the score of the anchor block among anchor block scores
for 7,082 searches with shuffled queries is shown.

2. If the family is represented by a single block in the Blocks Database, no 
map is shown.

3. The anchor block score for a hit must be > 1000; previously it was >= 1000.

                         ##########################
                         # Changes in Version 1.1 #
                         ##########################

1. Previously, only the best single alignment was saved for each block with
each sequence (protein) or with each frame (DNA).  Now the best rank-ordered 
alignments in the entire search are saved, even though this might include
multiple hits on a single sequence.  This increases sensitivity somewhat, and 
also makes it possible to report multiple members of a duplication.  However,  
background also increases somewhat.

2. For all protein sequences and DNA sequences <=5000 bp, the best 400 hits 
(rather than 250 as before) are examined for multiple block hits.  For 
sequences >5000 bp, the best 1000 block hits are saved.

3. The maximum number of entries returned is now 10, rather than 5 for 
all protein sequences and all DNA sequences <10000 bp.  This maximum is
increased for sequences >10000 bp at the rate of 1 more entry for every 1000 
bp beyond 10000.

4. The P-value calculation has changed slightly, in part to accommodate the 
change described in 1. This should have little effect in practice. For a 
detailed description of how the P-value is calculated, contact 
henikoff@howard.fhcrc.org (Manuscript in preparation).  A submitted manuscript 
describing the BLOCKS e-mail server is also available upon request.

________________________________________