++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + + + BBBBBB OOOOO SSSSSS CCCCCC AAA N N + + B B @ O O S C A A NN N + + B B i O O S C A A N N N + + BBBBBBB ii O O SSSSS C AAAAAAA N N N + + B B ii O O S C A A N N N + + B B ii O O S C A A N NN + + BBBBBB iii OOOOO SSSSSS CCCCCC A A N N + + + + _ _ ___ + + | | |_| _ ___________ _____ _____ __/ / + + | |__ _ ____ | |__ / _____/ // _ // _ // / + + | || || . || | /__ // / // __// ___// / / + + |__|_||_||__ ||__|_|/_____// __//____//____//____/ + + |_____| /_/ "Computer!" + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Department of Computer Science CB #3175 Sitterson Hall Chapel Hill, NC 27599-3175 Tel: (919) 962-1740 Fax: (919) 962-1799 o AUTOMATIC E-mail Server -- bioscan@cs.unc.edu o ANONYMOUS FTP Server -- Host: ftp.cs.unc.edu (presently 152.2.128.159) -- Path: /pub/bioscan o OTHER Information -- bioscan-info@cs.unc.edu ============================================================================== ++++++++++++ UNC-Chapel Hill BioSCAN E-Mail Server Instructions ++++++++++++ ============================================================================== ------------ 1. Introduction ------------ BioSCAN (Biological Sequence Comparative Analysis Node) is a massively parallel computer system designed for biological sequence similarity analysis. This system performs rapid, rigorous, searches of DNA, RNA, and protein sequence databases and is compatible with popular software packages such as BLAST and FASTA. BioSCAN uses a rigorous implementation of linear similarity methods used in the BLAST systems. It is also well suited to multiscore applications using multiple PAM and BLOSUM scoring matrices. The server allows users to send a specially formatted mail message containing a DNA, RNA, or protein query sequence to the BioSCAN server at the UNC-Chapel Hill. A search is then performed against the specified database on the whole or a division using the special-purpose hardware accelerator, and the results are returned in a mail message. BioSCAN finds entry segments that are similar to query segments. Given a score matrix, a query sequence, a database of entry sequences, and an expectation, BioSCAN finds all entry sequences that have gapfree similarity to the query sequence. Specifically, BioSCAN reports a result for an entry sequence only if the least likely set of locally best gapfree alignments of query segments and entry segments is not more likely than the expectation. The query sequence and the entry sequences are strings of letters from an alphabet that explicitly excludes gaps. A segment is a contiguous substring of letters from a sequence. For example, "CAROL" and "LIN" are segments of the query sequence "CAROLINA" but "CARLA" and "RONA" are not. CAROLINA CAROL LIN CAR L A RO NA A gapfree alignment is the one-to-one pairwise mapping of each letter of one segment with a letter of another segment. It lacks insertions or deletions. For example, the unique gapfree alignment of the query segment "CAROLIN" and the entry segment "MARYANN" is: Query CAROLIN Entry MARYANN Here the aligned pairs of letters are arranged in vertical columns. It is common to include a comment line between the aligned segments. For example, the comment line below echoes identities: Query CAROLIN AR N Entry MARYANN The score table assigns an integer to each aligned pair of letters. A gapfree alignment score is the sum of the scores of the aligned letters. An alignment is locally best if making it longer or shorter at either end does not produce an alignment with a higher alignment score [Sellers, 1980, 1984]. The expectation is the number of gapfree alignments expected by chance when searching the database with the query sequence. If results of a BioSCAN database search are reported or published, please note that the computation was done at UNC-Chapel Hill using the BioSCAN network server. We ask that the following reference be cited: Singh, R.K., Tell, S.G., White, C.T., Hoffman, D., Chi, V.L., & Erickson, B.W. (1993) "A scalable systolic multiprocessor system for analysis of biological sequences". Research on Integrated Systems: Proceedings of the 1993 Symposium, MIT Press, Cambridge, MA, 168-182. ----------------------------------- 2. Accessing the BioSCAN E-Mail Server ----------------------------------- This section is based on the help document from NCBI for their BLAST E-mail server. To access the BioSCAN server, send an electronic mail message containing a properly formatted request (as described below) to the following Internet address: bioscan@cs.unc.edu If your site is not on the Internet, you may have to reformat the address slightly. -------------- 3. Obtaining Help -------------- To receive the current set of instructions on using the BioSCAN E-mail server, send a help message to the regular BioSCAN E-mail server address: bioscan@cs.unc.edu Put the word "HELP" on a line by itself in the body of the mail message. For further information on the BioSCAN project and other related matters, send a mail message with your question to project members at the address: bioscan-info@cs.unc.edu. Also, if you experience a delay in response to your submission, you may wish to send your e-mail address and a telephone number to bioscan-info@cs.unc.edu with a description of your submission. ---------------------------------- 4. Example of a BioSCAN E-Mail Search ---------------------------------- The format for BioSCAN queries is based on that accepted by the BLAST e-mail service. The first four lines in the example below comprise a mail message header that is automatically created by a mail program and bundled with the message when it is sent. Nothing needs to be entered for the Subject of a BioSCAN E-mail request; the Subject is ignored by the E-mail server. The actual search request begins with the mandatory directive 'PROGRAM' in the first column followed by the value 'bscan' (the name of the program). The program 'bscan' handles the requests for both amino- and nucleic acid sequences. The next line contains the mandatory search directive 'DATALIB' indicating which database is to be searched. Available databases include 'sw', 'gb', 'pir'; for a complete list see section 8. The third line contains an optional EXPECT directive with desired value for it. The fourth line contains the mandatory 'BEGIN' directive, followed by the query sequence in FASTA/Pearson format. Each line of information must be less than 80 characters in length. From: yourname@yourmachine.your.domain.name Tue Aug 15 21:36:38 1993 Date: 15 Aug 1992 21:29:02-EDT To: bioscan@cs.unc.edu Subject: PROGRAM bscan DATALIB sw MATRIX PAM120 EXPECT 0.01 BEGIN >ENV_BAEVM ENV POLYPROTEIN (CONTAINS: OUTER MEMBRANE PROTEIN GP70 A) MGFTTKIIFL YNLVLVYAGF DDPRKAIELV QKRYGRPCDC SGGQVSEPPS DRVSQVTCSG KTAYLMPDQR WKCKSIPKDT SPSGPLQECP CNSYQSSVHS SCYTSYQQCR SGNKTYYTAT LLKTQTGGTS DVQVLGSTNK LIQSPCNGIK GQSICWSTTA PIHVSDGGGP LDTTRIKSVQ RKLEEIHKAL YPELQYHPLA IPKVRDNLMV DAQTLNILNA TYNLLLMSNT SLVDDCWLCL KLGPPTPLAI PNFLLSYVTR SSDNISCLII PPLLVQPMQF SNSSCLFSPS YNSTEEIDLG HVAFSNCTSI TNVTGPICAV NGSVFLCGNN MAYTYLPTNW TGLCVLATLL PDIDIIPGDE PVPIPAIDHF IYRPKRAIQF IPLLAGLGIT AAFTTGATGL GVSVTQYTKL SNQLISDVQI LSSTIQDLQD QVDSLAEVVL QNRRGLDLLT AEQGGICLAL QEKCCFYVNK SGIVRDKIKT LQEELERRRK DLASNPLWTG LQGLLPYLLP FLGPLLTLLL LLTIGPCIFN RLTAFINDKL NIIHAMVLTQ QYQVLRTDEE AQD -------------------------- 5. Formatting a Search Request -------------------------- A search request consists of a mail message with a set of search parameters identifying the program (e.g., 'bscan' for proteins as well as nucleic acids), the database to be searched, values related to the search parameters, and the query sequence to be used in the search. Components of the mail message must be provided in this order: two mandatory directives (PROGRAM and DATALIB); any optional parameters or directives described below; another mandatory directive (BEGIN); and finally the query sequence on the remaining lines. Each directive must be specified on a separate line. ----------------- 6. Server Directives ----------------- Below is a table of UNC BioSCAN E-mail server directives, some of which are required to be present in every search request, and others which can be used to alter the default behavior of the server. Attribute definitions: Mandatory = the directive must appear in every search request. (Currently there are only 3 mandatory directives: PROGRAM, DATALIB, and BEGIN) Numerical = the directive uses a numerical type of argument (e.g., "EXPECT 100"). Text = the directive uses a textual type of argument (e.g., "DATALIB gb"). Boolean = the directive uses a Boolean type of argument ("yes", "no", "true", "false", "1", or "0") ArgumentRequired = the directive requires an argument. Server Directive Attributes Explanation -------- ---------- ----------- PROGRAM Mandatory The PROGRAM directive is used to specify the Text particular BioSCAN program to execute. Presently the only program available is 'bscan'; additional programs may be added in the future. All message lines following the PROGRAM directive are checked for validity by the server. Any erroneous directives appearing prior to the PROGRAM directive line will be ignored. For this reason, it is advised that the PROGRAM directive be the very first line in every request, so that the complete search request will be validity-checked. DATALIB Mandatory The DATALIB directive is used to indicate which database Text should be searched (see the list of databases Argument- in Section 8). Only one database can be searched per Required mail message. Databases can only be searched in their entirety; the only subsets available to be searched are the predefined ones listed in section 8. EXPECT Numerical Statistical expectation from which the cutoff score Argument- for reporting high-scoring segment pairs (HSPs) is Required calculated; the default value is 10, such that 10 matches satisfying the cutoff score will be expected to occur by chance alone. Fractional values are acceptable. MATRIX Text Specify an alternate substitution scoring matrix. Argument- The default matrix for peptide sequences is PAM120; Required the default for nucleotide sequences is DPAM47I/ Alternate matrices that may be specified include PAM40, PAM120 and PAM250. BEGIN Mandatory This mandatory directive is not paired with any value. It must appear after all other parameters and immediately before the query sequence. The following directives are accepted by the BioSCAN E-Mail server for compatibility with the NCBI BLAST server, but are presently ignored: HISTOGRAM DESCRIPTIONS ALIGNMENTS CUTOFF STRAND FILTER PATH SPLIT No other directives or parameters besides those described above are selectable through the E-mail server. Only those parameters with an M in the Attributes column are required to be present in a BioSCAN E-mail request. ------------------------- 7. The Query Sequence Format ------------------------- The query sequence begins on the line immediately following the BEGIN directive and must be in FASTA/Pearson format. A sequence in FASTA/Pearson format starts with a mandatory comment line with a greater-than sign '>' in the first column, followed by an arbitrary name for the sequence, a space, and an optional, arbitrary description. The sequence actually begins on the second line, and continues on subsequent lines as necessary. For the BioSCAN E-mail server, all lines should be 80 characters or less in length. >MNKSV40 Monkey DNA fragment of unknown function, acquired by Simian virus ggttaaaatggtgatttttatgctttgtgtattttaccacttttttttttttaaggcaga ttcctttcaatcatctgagtgagcccagtgcgatctgaagggtccctacaggtggaagag gcagtggccaggatcgcggt Sequences are expected to be represented using the standard IUB/IUPAC amino acid and nucleic acid codes, with the exception that lower-case letters are acceptable and are mapped into upper-case. Before submitting a request, any numbers in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid or X for unknown amino acid). Only one query sequence is allowed per mail message. Mail programs typically allow the user to import a file containing a sequence into the mail message. Assuming the sequence is already in FASTA/Pearson format, the sequence file should be imported into the mail message on the line after the 'BEGIN' directive. Please follow the format in the example in Section 4. A blank line after the last line of the query sequence is recommended since some mailers automatically append a signature block and the server needs to identify the last line of the sequence. ------------------- 8. Available Databases ------------------- The following databases are available for BioSCAN searching on the E-Mail server. The names shown are for use in DATALIB directives, to select the database to search. Predefined subsets of certain databases are also available. Peptide Sequence Databases: Name Description =========== ============= sw the last major release of the SWISS-PROT protein sequence database (no updates) pir the last major release of the PIR(R) protein sequence database Nucleotide Sequence Databases: Name Description =========== ============= gb All divisions of the last major release of the GenBank(R) nucleotide sequence database (does not include updates) Each division of the GenBank nucleotide sequence database can be accessed with the following names: gbbct Bacterial Sequences gbest Expressed Sequence Tag Sequences gbinv Invertebrate Sequences gbmam Other Mammalian Sequences gbpat Patent Sequences gbphg Phage Sequences gbpln Plant Sequences gbpri Primate Sequences gbrna Structural RNA Sequences gbrod Rodent Sequences gbsyn Synthetic Sequences gbuna Unannotated Sequences gbvrl Viral Sequences gbvrt Other Vertebrate Sequences ----------------------------------- 9. Obtaining Complete Sequence Records ----------------------------------- Complete sequence records are not available from the BioSCAN E-Mail server. They are available from the NCBI RETRIEVE server. Sequence records can be retrieved either by locus name or by accession number. To obtain full instructions on using the RETRIEVE server, send a help message to: retrieve@ncbi.nlm.nih.gov Put the word 'HELP' on a single line in the body of the mail message. No subject line is needed. --------------------------------- 10. Obtaining BioSCAN Client Software --------------------------------- The BioSCAN hardware system can be accessed directly from sites on the Internet without going through the mail server. Direct access over the Internet can result in much faster response time. In addition, the output from the client software can be piped directly into additional processing steps without the necessary interaction of recieving and saving the E-Mail response. The BioSCAN Client software package includes both complete application programs and a subroutine library that can be used to develop your own applications. The package is known to run on the SunOS 4.1 Unix and on VAX/VMS. It is usable on other versions of Unix supporting BSD-style sockets with little or no modification. The Client package is available via anonymous ftp from "ftp.cs.unc.edu;" the file containing the current version of the package is /pub/bioscan/Client_Beta_2.tar.Z. Instructions for installing the package are contained in the package itself and separately on the ftp.cs.unc.edu in the file /pub/bioscan/install.txt. ----------------------------------------------------------------------------- file: /home/bioscan/mailserver/bioscan.hlp Last Changed: September 29, 1993
________________________________________