GenomeNet FASTA e-mail server
FASTA E-Mail Server on GenomeNet
The FASTA program was developed by Bill Pearson and David Lipman for
nucleic acid and protein sequence similarity searching of databases and
made availabe for use on the GenomeNet E-mail server. If you use FASTA
as a research tool, we ask that the following reference be cited in your
paper:
Pearson, W.R. and Lipman, D.J. 1988. Improved Tools for
Biological Sequence Comparison. Proc. Natl. Acad. Sci.,
85: 2444-2448.
The GenomeNet FASTA server allows you to send a specially formatted mail
message containing the nucleic acid or protein query sequence to the
FASTA server at the Supercomputer Laboratory, Institute for Chemical
Research, Kyoto University. A FASTA sequence similarity search is then
performed against the specified database using the FASTA program. The
results of the FASTA search will be returned to you in a reply mail
message.
Accessing the FASTA program
To access the program, send an electronic mail message containing the
formatted query sequence (as described below) to the following Internet
address:
fasta@genome.ad.jp
If you are not on Internet, you may need to change the format of the
address. Consult your systems manager to determine the correct
address format.
Obtaining Help
If you would like to receive instructions on using the FASTA program,
send a mail message to the address above containing the word "help" on
a single line. This document is then returned to you in a mail message.
Databases for use with FASTA
The following databases are currently available for FASTA searches:
Designator Database
---------- --------
nr-nt Non-redundant nucleic acid sequence
database constructed from genbank,
embl,genbank-upd
genbank GenBank nucleic acid sequence database
(Latest quarterly release)
genbank-upd Cumulative daily updates of GenBank
since latest quarterly release
genbank/primate GenBank subdivisions
genbank/rodent (Latest quarterly release)
genbank/other_mammalian
genbank/other_vertebrate
genbank/invertebrate
genbank/plant
genbank/bacterial
genbank/structural_rna
genbank/viral
genbank/phage
genbank/synthetic
genbank/unannotated
genbank/est
genbank/patent
embl EMBL nucleic acid sequence database
(Latest quarterly release)
embl/bacteriophage EMBL subdivisions
embl/fungi (Latest quarterly release)
embl/invertebrate
embl/organelle
embl/other_mammalian
embl/other_vertebrate
embl/plant
embl/primate
embl/prokaryote
embl/rodent
embl/synthetic
embl/unannotated
embl/viral
nr-aa Non-redundant protein sequence database
constructed from swissprot,pir,prf,
genpept,genpept-upd
swissprot SWISS-PROT protein sequence database
pir PIR protein sequence database
prf PRF protein sequence database
pdbstr Protein Data Bank reorganized as
a sequence database
Formatting a Query
Queries consist of a mail message with search parameters identifying
the database to be searched, values related to the search and the
query sequence to be used in the search. The mail message has two
mandatory lines, three optional lines and a line identifying the query
sequence as descibed below. These lines are typed into the body of
the mail message in the order shown below:
Search
Parameter Mandatory Explanation
--------- --------- -----------
PROGRAM No Specify 'tfasta' to compare a protein
sequence against a nucleic acid sequence
database translating in all 6 or 3 frames.
If omitted, 'fasta' is assumed.
DATALIB Yes This line specifies the database to be
searched (as described above) for the query
sequence and must be included in the message.
KTUP No This line identifies the Ktup value which
specifies the sensitivity of the search.
Values range between 3 and 6 for nucleic acid
searches and between 1 and 2 for protein
searches. Lower values specify more sensitive
searches but require more time to complete.
The default value for nucleic acids is 6 and
2 for proteins.
No Options may be specified as described in the
attached manual page. Note that the -Q
(quiet) option is always assumed.
BEGIN Yes This line must be included in the message.
It should be the last parameter and followed
on the next line with the query sequence.
See below for the format of the query
sequence.
The remainder of the message contains the query sequence in Pearson FASTA
format.
Preparing Files for Similarity Searches
Only one query sequence is allowed per mail message and your sequence
must be in FASTA format. If your sequence file is in other formats,
it is possible to use an editor to change the file to FASTA format.
The format includes a mandatory comment line beginning with a greater-than
sign ">" followed by the name of the sequence, a space, and an optional
note about the sequence. The sequence data begin on the next line without
the greater-than sign. For example:
>AGREP4 Monkey SV40-like genomic segment promoting transcription.
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg
tttccatttactttggattatacgtcattataaa
Sending the Query Sequence
Use your local mail program to send your query sequence. Most
mail programs allow you to import a file containing your sequence into
the mail message. You should import your sequence file into the mail
message on the line after "BEGIN". Please follow the format in the
following example of a FASTA request PRECISELY, but note that the
program is case-insensitive, i.e. either upper or lower case letters
may be used.
Here is an example of a mail message sent for a FASTA search. Note that
the first four lines are a mail header that is automatically created
when you address a mail message. Nothing need be entered for the
Subject. NOTE: the text that you enter into the body of the message
begins with the "DATALIB" keyword below (do not add blank lines in the
message). Each line of information must be less than 80 characters in
length. Longer lines may be truncated.
From: somebody@someaddress.somewhere.ac.jp Wed May 27 16:34:06 1992
Date: Wed, 27 May 92 16:38:02 PST
To: fasta@genome.ad.jp
Subject:
DATALIB GenBank/other_mammalian
KTUP 4
-b 100 -d 20
BEGIN
>BOVPRL GenBank entry BOVPRL from gbmam file.907 nucleotides.
tgcttggctgaggagccataggacgagagcttcctggtgaagtgtgtttcttgaaatcat
caccaccatggacagcaaa
The example above requests that the fasta program to run with KTUP=4
and the numbers of similarity scores and alignments displayed to be
100 and 20, respectively.
Handling the Results of a FASTA Search
When the results are returned, use your local mail program to retrieve
them. You can transfer the results of a FASTA search to a separate
disk file to free up space in your mail directory. Consult the
documentation for your local mail program for the commands to read
and transfer mail.
Retrieving Individual Entries Found in FASTA searches
Database entries can be retrieved by either entry name or accession
number. To use the GenomeNet database retrieval server, send an electronic
mail message to:
dbget@genome.ad.jp
To get started, send a mail message to the address above containing the
word "help" on a single line in the body of the mail message.
Last Update: 93/04/08
==========================================================================
FASTA Manual Page (Refer to the first form of fasta and tfasta only.)
FASTA/TFASTA/LFASTAv1.5(1)USER COMMANDSFASTA/TFASTA/LFASTAv1.5(1)
NAME
fasta - scan a protein or DNA sequence library for similar
sequences
tfasta - compare a protein sequence to a DNA sequence
library, translating the DNA sequence library `on-the-fly'.
lfasta - compare two protein or DNA sequences for local
similarity and show the local sequence alignments
plfasta - compare two sequences for local similarity and
plot the local sequence alignments
SYNOPSIS
fasta [-a -b # -c # -d # -[f|k] -g # -l FASTLIBS -r STAT-
FILE -m # -o -p # -Q -s SMATRIX -w # -1 ] query-sequence-
file library-file [ ktup ]
fasta [-Qacglmnoprswx] query-file @library-name-file
fasta [-Qacglmnoprswx] query-file "%PRMVI"
fasta [-acglmnoprswx] - interactive mode
tfasta [-abcdfgmoprsw3] protein-query-file DNA-library [
ktup ]
lfasta [-amnpswx] sequence-file-1 sequence-file-2 [ ktup ]
plfasta [-amnpsxv] sequence-file-1 sequence-file-2 [ ktup ]
DESCRIPTION
fasta is used to compare a protein or DNA sequence to all of
the entries in a sequence library. For example, fasta can
compare a protein sequence to all of the sequences in the
NBRF PIR protein sequence database. fasta will automati-
cally decide whether the query sequence is DNA or protein by
reading the query sequence as protein and determining
whether the `amino-acid composition' is more than 85%
A+C+G+T. fasta uses an improved version of the rapid
sequence comparison algorithm described by Lipman and Pear-
son (Science, (1985) 227:1427) that is described in Pearson
and Lipman, Proc. Natl. Acad. USA, (1988) 85:2444. The pro-
gram can be invoked either with command line arguments or in
interactive mode. The optional third argument, ktup sets
the sensitivity and speed of the search. If ktup=2, similar
regions in the two sequences being compared are found by
looking at pairs of aligned residues; if ktup=1, single
aligned amino acids are examined. ktup can be set to 2 or 1
for protein sequences, or from 1 to 6 for DNA sequences.
Sun Release 4.1 Last change: local 1
FASTA/TFASTA/LFASTAv1.5(1)USER COMMANDSFASTA/TFASTA/LFASTAv1.5(1)
The default if ktup is not specified is 2 for proteins and 6
for DNA.
fasta compares a query sequence to a sequence library which
consists of sequence data interspersed with comments, see
below. Normally fasta and tfasta search the libraries
listed in the file pointed to by the environment variable
FASTLIBS. The format of this file is described in the file
FASTA.DOC. tfasta compares a protein sequence to a DNA
sequence database, translating the DNA sequence library in 6
frames `on-the-fly' (3 frames with the -3 option). The
search uses the standard PAM250 scoring matrix, and uses a
ktup=2 by default. tfasta searches a DNA sequence database
in the standard text format described below.
lfasta and plfasta programs compare two sequences looking
for local sequence similarities. While fasta and tfasta
report only the best alignment between the query sequence
and the library sequence, lfasta and plfasta will report all
of the alignments between the two sequences with scores
greater than a cut-off value. lfasta shows the actual local
alignments between the two sequences and their scores, while
plfasta produces a plot of the alignments that looks similar
to a `dot-matrix' homology plot. On Unix systems, plfasta
generates tektronix output that can either be displayed on a
tektronix terminal or piped through the tek2ps program for
output on the laser printer. On MS-DOS systems, plfasta
uses the graphics capabilities of the computer screen
together with the *.BGI graphics device drivers supplied by
Borland with Turbo `C'.
The fasta programs use a standard text format sequence file.
Lines beginning with '>' or ';' are considered comments and
ignored; sequences can be upper or lower case, blanks,tabs
and unrecognizable characters are ignored. fasta expects
sequences to use the single letter amino acid codes, see
protcodes(1) . Library files for fasta should have the form
shown below.
OPTIONS
fasta and the other programs can be directed to change the
scoring matrix, search parameters, output format, and
default search directories by entering options on the com-
mand line (preceeded by a `-' or `/' for MS-DOS). All of the
options should preceed the file name and ktup arguments).
Alternately, these options can be changed by setting
environment variables. The options and environment vari-
ables are:
-1 Normally, the top scoring sequences are ranked by their
initn score. By using the -1 option, sequences are
Sun Release 4.1 Last change: local 2
FASTA/TFASTA/LFASTAv1.5(1)USER COMMANDSFASTA/TFASTA/LFASTAv1.5(1)
ranked by their init1 score. -a (SHOWALL) Modifies the
display of the two sequences in alignments. Normally,
both sequences are shown only where they overlap
(SHOWALL=0); If -a or the environment variable SHOWALL
= 1, both sequences are shown in their entirety. -b #
The number of similarity scores to be shown when the -Q
option is used. This value is usually calculated based
on the actual scores.
-c # (OPTCUT) The threshold for optimization with the -o
option. The OPTCUT value is normally calculated based
on sequence length.
-d # The number of alignments to be shown. Normally, fasta
shows the same number of alignments as similarity
scores. By using fasta -Q -b 200 -d 50, one would see
the top scoring 200 sequences and alignments for the 50
best scores.
-f | -k
(PAMFACT) This version of fasta uses a more sensitive
method for identifying initial regions. Instead of
using a constant factor (fact) for each match in a
ktup, it uses the scoring matrix (PAM) scores. While
this works well for protein sequences, it has not been
as carefully tested for DNA sequences, so by default,
this modification is used for proteins but not for DNA.
The -f option forces this option on. -k forces it off.
Setting the PAMFACT environment variable to 1 forces
the option on; PAMFACT=0 turns it off.
-g # (GAPCUT) Sets the threshold for joining the initial
regions for calculating the initn score.
-l # (FASTLIBS) The name of the library menu file. Normally
this will be determined by the environment variable
FASTLIBS. However, a library menu file can also be
specified with -l.
-m # (MARKX) =1,2,3. Alternate display of matches and
mismatches in alignments. MARKX=1 uses ":","."," ", for
identities, consevative replacements, and non-
conservative replacements, respectively. MARKX=2 uses "
","x", and "X". MARKX=3 does not show the second
sequence, but uses the second alignment line to display
matches with a "." for identity, or with the
mismatched residue for mismatches. MARKX=3 is useful
for aligning large numbers of similar sequences.
-n Forces the query sequence to be treated as a DNA
sequence.
Sun Release 4.1 Last change: local 3
FASTA/TFASTA/LFASTAv1.5(1)USER COMMANDSFASTA/TFASTA/LFASTAv1.5(1)
-o Causes fasta to perform a limited optimization on all
of the sequences in the library with initn scores
greater than OPTCUT. This slows the program down about
5-fold, but, when combined with ktup=1, provides an
extremely sensitive sequence comparison.
-Q Quiet option. This allows fasta and tfasta to search a
database and report the results without asking any
questions. fasta -Q file library > output can be put in
the background or run at a later time with the unix
'at' command. The number of similarity scores and
alignments displayed with the -Q option can be modified
with the -b (scores) and -d (alignments) options.
-r STATFILE Causes fasta to write out the sequence iden-
tifier, superfamily number (if available), and similar-
ity scores to STATFILE for every sequence in the
library. These results are not sorted.
-s str
(SMATRIX) the filename of an alternative scoring matrix
file.
-v str
(LINEVAL) (plfasta only) plfasta and pclfasta can use
up to 4 different line styles to denote the scores of
local alignments. The scores that correspond to these
line styles can be specified with the environment vari-
able LINVAL, or with the -v option. In either case, a
string with three numbers separated by spaces should be
given. This string must be surrounded by double quota-
tion marks. For example, LINEVAL="200 100 50" tells
plfasta to use solid lines for local alignments with
scores greater than 200, long dashed lines for scores
between 100 and 200, short dashed lines for scores
between 50 and 100, and dotted lines for scores less
than 50.
plfasta -v "200 100 50"
Normally, the values are 200, 100, and 50 for protein
sequence comparisons and 400, 200, and 100 for DNA
sequence comparisons.
-w # (LINLEN) output line length for sequence alignments.
(normally 60, can be set up to 200).
-x "offset1 offset2"
Causes fasta/lfasta/plfasta to start numbering the
aligned sequences starting with offset1 and offset2,
rather than 1 and 1. This is particularly useful for
showing alignments of promoter regions.
-3 tfasta only. Normally tfasta translate sequences in
Sun Release 4.1 Last change: local 4
FASTA/TFASTA/LFASTAv1.5(1)USER COMMANDSFASTA/TFASTA/LFASTAv1.5(1)
the DNA sequence library in all six frames. With the
-3 option, only the three forward frames are searched.
EXAMPLES
(1) fasta musplfm.aa $AABANK
Compare the amino acid sequence in the file musplfm.aa with
the complete PIR protein sequence library using ktup = 2
Each "library" sequence (there need only be one) should
start with a comment line which starts with a '>', e.g.
>LCBO bovine preprolactin
WILLLSQ ...
>LCHU human ...
...
(2) fasta -a -w 80 musplfm.aa lcbo.aa 1
Compare the amino acid sequence in the file musplfm.aa with
the sequences in the file lcbo.aa using ktup = 1. Show both
sequences in their entirety, with 80 residues on each output
line.
(3) fasta
Run the fasta program in interactive mode. The program will
prompt for the file name for the query sequence, list alter-
native libraries to be seached (if FASTLIBS is set), and
prompt for the ktup.
FILES
This version of fasta prompts for the library file to be
searched from a list of file names that are saved in the
file pointed to by the environment variable FASTLIBS. If
FASTLIBS = fastgb.list, then the file fastgb.list might have
the entries:
NBRF Protein$0P/u/lib/aabank.lib 0
GB Primate$1P@/u/lib/gpri.nam
GB Rodent$1R@/u/lib/grod.nam
GB Mammal$1M@/u/lib/gmammal.nam
Each line in this file has 4 fields: (1) The library name,
separated from the remaining fields by a '$'; (2) A 0 or a 1
indicating protein or DNA library respectively; (3) A single
letter that will be used to choose the library; (4) the
location of the library file itself (the library file name
can contain an optional library format specfier. Fasta
recognizes the following library formats: 0 - Pearson/FASTA;
1 - Genbank tape; 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT;
4 - Intelligenetics; 5 - NBRF/PIR VMS; 9 - Genbank floppy
Sun Release 4.1 Last change: local 5
FASTA/TFASTA/LFASTAv1.5(1)USER COMMANDSFASTA/TFASTA/LFASTAv1.5(1)
disk); Note that this fourth field can contain an '@' char-
acter, which indicates that the library file is an indirect
library file containing list of library files, one per line.
An indirect library file can also contain a line beginning
with the symbol '<', followed by the directory where the
library files may be found, and a line beginning with a '>',
indicating the name of the index file (GENBANK compressed
floppy format files only). An indirect library file might
have the lines:
glocus.idx (index file for GENBANK binary files)
gpri1.seq 9
gpri2.seq 9
gpri3.seq 9
...
grod1.seq 9
...
This version of fasta can also distinguish between normal
text library files (as shown above in EXAMPLE (2)), and DNA
libraries in the GENBANK compressed floppy disk format.
These latter files are binary files that are distributed by
Intelligenetics on floppy disks. Earlier versions of fasta
(and fastn before it) used different programs to read the
text library files (old fasta or ifastn) and the compressed
files (old fastgb and gfastn). These routines have been
combined in the current fasta.
You can use your own sequence files for fasta, just be cer-
tain to put a '>' and comment as the first line before the
sequence. Only one library file type, the standard NBRF
library format, is supported by the VAX/VMS programs.
lfasta and plfasta do not required the '>' and comment line.
fasta does.
SEE ALSO
rdf2(1),protcodes(5), dnacodes(5)
AUTHOR
Bill Pearson
wrp@virginia.EDU
Sun Release 4.1 Last change: local 6
________________________________________