Genome Information Research Center, Osaka Univ.

BioSCAN e-mail server

        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        +                                                              +
        +  BBBBBB         OOOOO    SSSSSS   CCCCCC    AAA    N     N   +
        +  B     B    @  O     O  S        C         A   A   NN    N   +
        +  B     B    i  O     O  S        C        A     A  N N   N   +
        +  BBBBBBB   ii  O     O   SSSSS   C        AAAAAAA  N  N  N   +
        +  B     B   ii  O     O        S  C        A     A  N   N N   +
        +  B     B   ii  O     O        S  C        A     A  N    NN   +
        +  BBBBBB   iii   OOOOO   SSSSSS    CCCCCC  A     A  N     N   +
        +                                                              +
        +        _     _                                           ___ +
        +       | |   |_|       _       ___________ _____ _____ __/  / +
        +       | |__  _  ____ | |__   / _____/   // _  // _  //    /  +
        +       |    || ||  . ||    | /__   // / //  __// ___// /  /   +
        +       |__|_||_||__  ||__|_|/_____// __//____//____//____/    +
        +               |_____|            /_/  "Computer!"            +
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

                 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
                      Department of Computer Science
                          CB #3175 Sitterson Hall
                         Chapel Hill, NC 27599-3175
                            Tel: (919) 962-1740
                            Fax: (919) 962-1799


 o AUTOMATIC E-mail Server
        --  bioscan@cs.unc.edu

 o ANONYMOUS FTP Server
        --  Host: ftp.cs.unc.edu (presently 152.2.128.159)
        --  Path: /pub/bioscan

 o OTHER Information
        --  bioscan-info@cs.unc.edu


==============================================================================
++++++++++++   UNC-Chapel Hill BioSCAN E-Mail Server Instructions ++++++++++++
==============================================================================
   ------------
1. Introduction
   ------------

BioSCAN (Biological Sequence Comparative Analysis Node) is a massively
parallel computer system designed for biological sequence similarity
analysis.  This system performs rapid, rigorous, searches of DNA, RNA,
and protein sequence databases and is compatible with popular software
packages such as BLAST and FASTA.  BioSCAN uses a rigorous implementation
of linear similarity methods used in the BLAST systems.  It is also well
suited to multiscore applications using multiple PAM and BLOSUM scoring
matrices.

The server allows users to send a specially formatted mail message
containing a DNA, RNA, or protein query sequence to the BioSCAN server 
at the UNC-Chapel Hill.  A search is then performed against the specified 
database on the whole or a division using the special-purpose hardware 
accelerator, and the results are returned in a mail message.

BioSCAN finds entry segments that are similar to query segments.  Given
a score matrix, a query sequence, a database of entry sequences, and an
expectation, BioSCAN finds all entry sequences that have gapfree similarity
to the query sequence.  Specifically, BioSCAN reports a result for an entry
sequence only if the least likely set of locally best gapfree alignments of
query segments and entry segments is not more likely than the expectation.

The query sequence and the entry sequences are strings of letters from an
alphabet that explicitly excludes gaps.  A segment is a contiguous substring
of letters from a sequence.  For example, "CAROL" and "LIN" are segments of
the query sequence "CAROLINA" but "CARLA" and "RONA" are not.

        CAROLINA
        CAROL
            LIN
        CAR L  A
          RO  NA

A gapfree alignment is the one-to-one pairwise mapping of each letter of one
segment with a letter of another segment.  It lacks insertions or deletions.
For example, the unique gapfree alignment of the query segment "CAROLIN"
and the entry segment "MARYANN" is:

Query   CAROLIN
Entry   MARYANN

Here the aligned pairs of letters are arranged in vertical columns.  It is
common to include a comment line between the aligned segments.  For example,
the comment line below echoes identities:

Query   CAROLIN
         AR   N
Entry   MARYANN

The score table assigns an integer to each aligned pair of letters.  A
gapfree alignment score is the sum of the scores of the aligned letters.  An
alignment is locally best if making it longer or shorter at either end does
not produce an alignment with a higher alignment score [Sellers, 1980, 1984].

The expectation is the number of gapfree alignments expected by chance when
searching the database with the query sequence.

If results of a BioSCAN database search are reported or published, please
note that the computation was done at UNC-Chapel Hill using the BioSCAN 
network server.  We ask that the following reference be cited:

	Singh, R.K., Tell, S.G., White, C.T., Hoffman, D., Chi, V.L., & 
 	Erickson, B.W. (1993) "A scalable systolic multiprocessor system for 
	analysis of biological sequences".  Research on Integrated Systems:  
	Proceedings of the 1993 Symposium, MIT Press, Cambridge, MA, 168-182.


   -----------------------------------
2. Accessing the BioSCAN E-Mail Server
   -----------------------------------

This section is based on the help document from NCBI for their BLAST E-mail
server.  To access the BioSCAN server, send an electronic mail message
containing a properly formatted request (as described below) to the
following Internet address:

        		bioscan@cs.unc.edu

If your site is not on the Internet, you may have to reformat the address
slightly.

   --------------
3. Obtaining Help
   --------------

To receive the current set of instructions on using the BioSCAN E-mail server,
send a help message to the regular BioSCAN E-mail server address:

        		bioscan@cs.unc.edu

Put the word "HELP" on a line by itself in the body of the mail message. 

For further information on the BioSCAN project and other related matters, 
send a mail message with your question to project members at the address:  
bioscan-info@cs.unc.edu.  Also, if you experience a delay in response to your
submission, you may wish to send your e-mail address and a telephone number 
to bioscan-info@cs.unc.edu with a description of your submission.


   ----------------------------------
4.  Example of a BioSCAN E-Mail Search
   ----------------------------------

The format for BioSCAN queries is based on that accepted by the BLAST e-mail
service.  The first four lines in the example below comprise a mail message
header that is automatically created by a mail program and bundled with the
message when it is sent.  Nothing needs to be entered for the Subject of a
BioSCAN E-mail request; the Subject is ignored by the E-mail server.

The actual search request begins with the mandatory directive 'PROGRAM' in
the first column followed by the value 'bscan' (the name of the program).
The program 'bscan' handles the requests for both amino- and nucleic acid
sequences.  The next line contains the mandatory search directive 'DATALIB'
indicating which database is to be searched.  Available databases include
'sw', 'gb', 'pir'; for a complete list see section 8.  The third line
contains an optional EXPECT directive with desired value for it.  The fourth
line contains the mandatory 'BEGIN' directive, followed by the query sequence
in FASTA/Pearson format.  Each line of information must be less than 80
characters in length.

     From:  yourname@yourmachine.your.domain.name  Tue Aug 15 21:36:38 1993
     Date:  15 Aug 1992 21:29:02-EDT
     To:    bioscan@cs.unc.edu
     Subject:  

     PROGRAM bscan
     DATALIB sw
     MATRIX PAM120
     EXPECT 0.01
     BEGIN
     >ENV_BAEVM  ENV POLYPROTEIN (CONTAINS: OUTER MEMBRANE PROTEIN GP70 A)
         MGFTTKIIFL YNLVLVYAGF DDPRKAIELV QKRYGRPCDC SGGQVSEPPS DRVSQVTCSG
         KTAYLMPDQR WKCKSIPKDT SPSGPLQECP CNSYQSSVHS SCYTSYQQCR SGNKTYYTAT
         LLKTQTGGTS DVQVLGSTNK LIQSPCNGIK GQSICWSTTA PIHVSDGGGP LDTTRIKSVQ
         RKLEEIHKAL YPELQYHPLA IPKVRDNLMV DAQTLNILNA TYNLLLMSNT SLVDDCWLCL
         KLGPPTPLAI PNFLLSYVTR SSDNISCLII PPLLVQPMQF SNSSCLFSPS YNSTEEIDLG
         HVAFSNCTSI TNVTGPICAV NGSVFLCGNN MAYTYLPTNW TGLCVLATLL PDIDIIPGDE
         PVPIPAIDHF IYRPKRAIQF IPLLAGLGIT AAFTTGATGL GVSVTQYTKL SNQLISDVQI
         LSSTIQDLQD QVDSLAEVVL QNRRGLDLLT AEQGGICLAL QEKCCFYVNK SGIVRDKIKT
         LQEELERRRK DLASNPLWTG LQGLLPYLLP FLGPLLTLLL LLTIGPCIFN RLTAFINDKL
         NIIHAMVLTQ QYQVLRTDEE AQD


   --------------------------
5. Formatting a Search Request
   --------------------------

A search request consists of a mail message with a set of search parameters
identifying the program (e.g., 'bscan' for proteins as well as nucleic
acids), the database to be searched, values related to the search parameters,
and the query sequence to be used in the search.  

Components of the mail message must be provided in this order:  two mandatory
directives (PROGRAM and DATALIB); any optional parameters or directives
described below; another mandatory directive (BEGIN); and finally the query
sequence on the remaining lines.  Each directive must be specified on a
separate line.

   -----------------
6. Server Directives
   -----------------

Below is a table of UNC BioSCAN E-mail server directives, some of which are
required to be present in every search request, and others which can be used to
alter the default behavior of the server.  

Attribute definitions:

  Mandatory = the directive must appear in every search request.  (Currently
           there are only 3 mandatory directives:  PROGRAM, DATALIB, and BEGIN)
  Numerical = the directive uses a numerical type of argument (e.g.,
           "EXPECT 100").
  Text = the directive uses a textual type of argument (e.g., "DATALIB gb").
  Boolean = the directive uses a Boolean type of argument ("yes", "no",
           "true", "false", "1", or "0")
  ArgumentRequired = the directive requires an argument.

 Server 
Directive  Attributes  Explanation
--------   ----------  -----------
PROGRAM    Mandatory    The PROGRAM directive is used to specify the
           Text 	particular BioSCAN program to execute.  Presently
			the only program available is 'bscan'; additional
			programs may be added in the future.  All message
			lines following the PROGRAM directive are checked
			for validity by the server.  Any erroneous
			directives appearing prior to the PROGRAM directive
			line will be ignored.  For this reason, it is
			advised that the PROGRAM directive be the very first
			line in every request, so that the complete search
			request will be validity-checked.

DATALIB    Mandatory    The DATALIB directive is used to indicate which database
           Text         should be searched (see the list of databases
           Argument-    in Section 8).  Only one database can be searched per
            Required    mail message.  Databases can only be searched in their
                        entirety; the only subsets available to be searched are
			the predefined ones listed in section 8.

EXPECT     Numerical    Statistical expectation from which the cutoff score
           Argument-    for reporting high-scoring segment pairs (HSPs) is
            Required    calculated; the default value is 10, such that 10
                        matches satisfying the cutoff score will be expected
                        to occur by chance alone.  Fractional values are
                        acceptable.
      
MATRIX     Text         Specify an alternate substitution scoring matrix.
           Argument-    The default matrix for peptide sequences is PAM120;
	    Required    the default for nucleotide sequences is DPAM47I/

                        Alternate matrices that may be specified include
			PAM40, PAM120 and PAM250.  

BEGIN      Mandatory    This mandatory directive is not paired with any value.
                        It must appear after all other parameters and
                        immediately before the query sequence.
                                                            

The following directives are accepted by the BioSCAN E-Mail server for
compatibility with the NCBI BLAST server, but are presently ignored:

	HISTOGRAM
	DESCRIPTIONS
	ALIGNMENTS
	CUTOFF
	STRAND
	FILTER
	PATH
	SPLIT

No other directives or parameters besides those described above are
selectable through the E-mail server.  Only those parameters with an M in the
Attributes column are required to be present in a BioSCAN E-mail request.

    -------------------------
7.  The Query Sequence Format
    -------------------------

The query sequence begins on the line immediately following the BEGIN
directive and must be in FASTA/Pearson format.  A sequence in FASTA/Pearson
format starts with a mandatory comment line with a greater-than sign '>' in
the first column, followed by an arbitrary name for the sequence, a space,
and an optional, arbitrary description.  The sequence actually begins on the
second line, and continues on subsequent lines as necessary.  For the
BioSCAN E-mail server, all lines should be 80 characters or less in length.

     >MNKSV40 Monkey DNA fragment of unknown function, acquired by Simian virus
     ggttaaaatggtgatttttatgctttgtgtattttaccacttttttttttttaaggcaga
     ttcctttcaatcatctgagtgagcccagtgcgatctgaagggtccctacaggtggaagag
     gcagtggccaggatcgcggt

Sequences are expected to be represented using the standard IUB/IUPAC amino
acid and nucleic acid codes, with the exception that lower-case letters are
acceptable and are mapped into upper-case.  Before submitting a request, any
numbers in the query sequence should either be removed or replaced by
appropriate letter codes (e.g., N for unknown nucleic acid or X for unknown
amino acid).  Only one query sequence is allowed per mail message.  

Mail programs typically allow the user to import a file containing a sequence
into the mail message.  Assuming the sequence is already in FASTA/Pearson
format, the sequence file should be imported into the mail message on the
line after the 'BEGIN' directive.  Please follow the format in the example in
Section 4.  A blank line after the last line of the query sequence is
recommended since some mailers automatically append a signature block and the
server needs to identify the last line of the sequence.


    -------------------
8.  Available Databases
    -------------------

The following databases are available for BioSCAN searching on the E-Mail
server.  The names shown are for use in DATALIB directives, to select the
database to search.  Predefined subsets of certain databases are also
available.

Peptide Sequence Databases:

	Name		Description
	===========	=============
  
	sw		the last major release of the SWISS-PROT protein
			sequence database (no updates)

	pir		the last major release of the PIR(R) protein
			sequence database

Nucleotide Sequence Databases:

	Name		Description
	===========	=============

	gb		All divisions of the last major release of the
			GenBank(R) nucleotide sequence database (does not
			include updates)

Each division of the GenBank nucleotide sequence database can be accessed
with the following names:

	gbbct		Bacterial Sequences
	gbest		Expressed Sequence Tag Sequences
	gbinv		Invertebrate Sequences
	gbmam		Other Mammalian Sequences
	gbpat		Patent Sequences
	gbphg		Phage Sequences
	gbpln		Plant Sequences
	gbpri		Primate Sequences
	gbrna		Structural RNA Sequences
	gbrod		Rodent Sequences
	gbsyn		Synthetic Sequences
	gbuna		Unannotated Sequences
	gbvrl		Viral Sequences
	gbvrt		Other Vertebrate Sequences

     -----------------------------------
9.  Obtaining Complete Sequence Records
     -----------------------------------

Complete sequence records are not available from the BioSCAN E-Mail server.
They are available from the NCBI RETRIEVE server.  Sequence records
can be retrieved either by locus name or by accession number.  To obtain
full instructions on using the RETRIEVE server, send a help message to:

        retrieve@ncbi.nlm.nih.gov

Put the word 'HELP' on a single line in the body of the mail message.  No
subject line is needed.

     ---------------------------------
10.  Obtaining BioSCAN Client Software
     ---------------------------------

The BioSCAN hardware system can be accessed directly from sites on the
Internet without going through the mail server.  Direct access over the
Internet can result in much faster response time.  In addition, the output
from the client software can be piped directly into additional processing
steps without the necessary interaction of recieving and saving the E-Mail
response.

The BioSCAN Client software package includes both complete application
programs and a subroutine library that can be used to develop your own
applications.  The package is known to run on the SunOS 4.1 Unix and on
VAX/VMS.  It is usable on other versions of Unix supporting BSD-style
sockets with little or no modification.

The Client package is available via anonymous ftp from "ftp.cs.unc.edu;" the
file containing the current version of the package is
/pub/bioscan/Client_Beta_2.tar.Z.  Instructions for installing the package
are contained in the package itself and separately on the ftp.cs.unc.edu in
the file /pub/bioscan/install.txt.

-----------------------------------------------------------------------------
file: /home/bioscan/mailserver/bioscan.hlp   Last Changed: September 29, 1993
________________________________________