HELP [QUICKSEARCH] Introduction ------------ Mail-QUICKSEARCH is based on the QUICKSEARCH and QUICKSHOW programs developed by John Devereux as implemented in the GCG package [1]. These programs have been considerably improved by Peter Rice at EMBL. It allows you to perform very rapid comparisons of your nucleic acid sequences against the EMBL and GenBank databases including the most recent entries. It answers the question: does this sequence already occur in the database (with a small number of mismatches) ? Note ---- This is a not only a new service, but the underlying programs and algorithms are also still experimental. Feel free to send us any comments, proposals, ideas etc.! How to Use Mail-QUICKSEARCH --------------------------- Using Mail-QUICKSEARCH is simple. Send a properly formatted normal mail message to QUICK@EMBL-Heidelberg.DE and wait for the results to drop into your mailbox. Please, don't send interactive messages, the software can't handle them! The Input Format ---------------- Since Mail-QUICKSEARCH is an automatic process without any human intervention it only understands a limited set of commands. Thus you have to adhere to a well-defined syntax, which is pretty easy to learn and understand and should not cause any problems. Some general rules are: - Your mail message must contain only one command per line. - There is only one mandatory command, SEQ. All the other commands are optional, and default values will be used whenever they are not specified. - You can use both uppercase and lowercase characters, or mix them. - The order of the commands is not important, but make sure that SEQ is the last one, since everything following this line will be treated as a sequence (see below). - Blank lines or space characters are accepted. Here is a list of valid commands that are accepted by Mail-QUICKSEARCH. Please remember that the only required command is SEQ and the default values are almost always adequate. HELP you know what it's for, don't you ? LIB libraryname "libraryname" can be one of the following (the default is ALL): ALL all EMBL and GenBank entries GENEW new EMBL and GenBank entries since latest release only WINDOW n The values of WINDOW and STRINGENCY determine the STRINGENCY n sensitivity of a search. The default values are calculated from the length of your query sequence (WINDOW is length/20 minus 1 with a maximum of 15; STRINGENCY is WINDOW/2). Decreasing the window size and decreasing STRINGENCY (e.g. WINDOW 20, STRINGENCY 5) will increase the sensitivity of the comparison, i.e. you may find more distantly related sequences. Increasing the window size and increasing STRINGENCY (e.g. WINDOW 50, STRINGENCY 47) will decrease the sensitivity. Only very few mismatches will be tolerated. For exact matches use PERFECT instead. PERFECT This option reports only exact matches. It is equivalent to specifying a MATCH value of 100. MATCH n Only database entries that show overlaps of more than n% identity to your query sequence will be presented. The default value is 90. BEST Determines the algorithm used for the alignment of the query sequence to the database sequences. If this option is used, Mail-QUICKSEARCH will use a local homology algorithm [2] to align sequences. Default is a Needleman-Wunsch alignment [3] that finds the best overall alignment. ONE Only the strand given in your mail message is compared against the database. If you don't specify ONE then the complementary strand is searched as well. TITLE string QUICKSEARCH will use "string" as the Subject line of the mail message that contains the output of your job. Don't use quotes or double quotes !!! SEQ (MANDATORY) *Everything* following this line up to a) the end of the mail message, or b) a line starting with the word END will be treated as part of the sequence. Don't put sequence information on the same line on which you put the SEQ command or the END command. No special format of the query sequence is required. You may include numbering, but make sure to remove all sorts of comments and unrelated information like mail signatures if you don't use END! Examples of input files: TITLE This is a test using part of a human globin gene SEQ 201 ACAACTTTGA CTTTGAGAAA AGAGAGGTGG AAATGAGGAA AATGACTTTT 251 CTGTATTAGA TTCCAGTAGA AAGAACTTTC ATCTTTCCCT CGTTTTTTTT 301 GTTTTAAAAC ATCTATCTGG AGGCAGGACA AGTATGGTCG TTAAAAAGAT 351 GCAGGCAGAA GGCATATATT GGCTCAGTCA AAGTGGGGAA CTTTGGTGGC 401 CAAACATACA TTGCTAAGGC TATTCCTATA TCAGCTGGAC ACATATAAAA 451 TGCTGCTAAT GCTTCATTAC AAACTTATAT CCTTTAATTC CAGATGGGGG 501 CAAAGTATGT CCAGGGGTGA GGAACAATTG AAACATTTGG GCTGGAGTAG 551 ATTTTGAAAG TCAGCTCTGT GTGTGTGTGT GTGTGTGCGC GCACGTGTGT END or: WINDOW 30 STRINGENCY 10 ONE MATCH 95 BEST SEQ agcgcgcgtcgtgcgtgcgtgcagatgacaaagtgacgtg gacgatggcatgacgatacgatgcagatgacgatg Restrictions ------------ The query sequence may not be longer than 100000 bases. If a match longer than 32000 bases is found, only the first part may be reported. The maximum WINDOW size is 50. The Mail-QUICKSEARCH output --------------------------- After sending your query to EMBL you will receive two mail messages from the Mail-QUICKSEARCH program. The first one is sent to you immediately after your message is processed. If Mail-QUICKSEARCH had any problem with your query, it will tell you so. Otherwise you will be notified that your job has been successfully submitted to the QUICKSEARCH batch queue, and that the results will be mailed to you after completion. Here is an example: (the > character at the beginning of a line indicates that this line is taken from a sample file. It is not contained in the messages that you receive): > From: QUICK@EMBL-Heidelberg.DE > Subject: Thanks for your call; here's the log ... > To: JOEBIOL@EMBL-Heidelberg.DE > Message-id:> X-Organization: European Molecular Biology Laboratory, Heidelberg. > X-Envelope-to: JOEBIOL > X-VMS-To: in%"JOEBIOL@EMBL-Heidelberg.DE" > > > TITLE This is a test using part of a human globin gene > SEQ > 201 ACAACTTTGA CTTTGAGAAA AGAGAGGTGG AAATGAGGAA AATGACTTTT > > 251 CTGTATTAGA TTCCAGTAGA AAGAACTTTC ATCTTTCCCT CGTTTTTTTT > > 301 GTTTTAAAAC ATCTATCTGG AGGCAGGACA AGTATGGTCG TTAAAAAGAT > > 351 GCAGGCAGAA GGCATATATT GGCTCAGTCA AAGTGGGGAA CTTTGGTGGC > > 401 CAAACATACA TTGCTAAGGC TATTCCTATA TCAGCTGGAC ACATATAAAA > > 451 TGCTGCTAAT GCTTCATTAC AAACTTATAT CCTTTAATTC CAGATGGGGG > > 501 CAAAGTATGT CCAGGGGTGA GGAACAATTG AAACATTTGG GCTGGAGTAG > > 551 ATTTTGAAAG TCAGCTCTGT GTGTGTGTGT GTGTGTGCGC GCACGTGTGT > END > > * A QUICK batch job has been submitted to the QUICK batch queue. > * The following parameters are used: > * Title: This is a test using part of a human globin gene > * Library to be searched: ALL > * Window: 15 > * Stringency: 7 > * Match: 90% > * Both strands searched > * All overlaps better than 90% will be reported > * A global alignment method will be used > * The result file will be mailed to you after completion. The second file that you will receive contains the results of your query. The Search Results ------------------ The first lines following the mail header contain information about the sequence name, the date and the parameters used: > QUICKMATCH of: JoeBiol_28007243.Quick April 25, 1990 10:57 > > ** MatchStringency: 0.90 ** > > ! QUICKSEARCH of: Sys$Scratch:JoeBiol_28007243.Seq; April 25, 1990 10:51 > > Comparison Table: Gencoredisk:[Gcgcore.Rundata]Nwsgapdna.Cmp > > Gap Weight: 5.00 Gap Length Weight: 0.10 .. Now, all hits better than your MATCH value will be shown as alignments between the database sequence and your query sequence. The order of these hits does *not* reflect the quality of the alignments! If there were no hits (nothing similar in the database, or your WINDOW value was too low or STRINGENCY too high to find anything), you will get the following message: *** No possible matches were found by QUICKSEARCH *** If there were some hits, but none were good enough for your MATCH value (there are often several "random" hits that mean nothing), you will see: *** No matches accepted at stringency 0.99 or better *** where "0.99" comes from a MATCH 99 (percent) command for example. In the successful alignments, you will find the following values: "Gaps:" gives you the number of gaps introduced to produce the alignment. "Quality:" is the score obtained for this alignment (see Algorithm). "Ratio:" is the quality divided by the number of residues in the overlap region between the two sequences (usually the length of the shorter sequence) > JoeBiol_28007243.Seq;2 Check: 5,507 length: 400 from: 1 to: 400 > JoeBiol_28007243.Seq; Length: 400 April 25, 1990 10:50 Check:5,507 > > Empri:Ggagglog Check: 7,760 length: 1,797 from: 1 to: 1,797 > Gorilla fetal A-gamma-globin gene. 1/86 > ID GGAGGLOG standard; DNA; 1797 BP. > AC X03112; > DT 20-JAN-1986 (annotation) > DE Gorilla fetal A-gamma-globin gene > KW A-gamma-globin; direct repeat; gamma-globin; tandem repeat. > OS Gorilla gorilla (gorilla) > OC Eukaryota; Metazoa; Chordata; Vertebrata; Tetrapoda; Mammalia; > OC Eutheria; Primates. > RN [1] (bases 1-1797) > RA Scott A.F., Heath P., Trusko S., Boyer S.H., Prass W., Goodman M., . . . > Diagonal: 754 Range: -399/+400 > Gaps: 0 Quality: 379.0 Ratio: 0.947 > . . . . . > 1 ACAACTTTGACTTTGAGAAAAGAGAGGTGGAAATGAGGAAAATGACTTTT 50 > ||||||||||||||||||| ||||| |||||||||||| ||||||||||| > 755 ACAACTTTGACTTTGAGAATAGAGAAGTGGAAATGAGGCAAATGACTTTT 804 > . . . . . > 51 CTGTATTAGATTCCAGTAGAAAGAACTTTCATCTTTCCCTCGTTTTTTTT 100 > || |||||||||||||||||||||||||||||||||||||| ||||| || > 805 CTTTATTAGATTCCAGTAGAAAGAACTTTCATCTTTCCCTCATTTTTGTT 854 > . . . . . > 101 GTTTTAAAACATCTATCTGGAGGCAGGACAAGTATGGTCGTTAAAAAGAT 150 > ||||||||||||||||||||||||||||||||||||||| ||||| |||| > 855 GTTTTAAAACATCTATCTGGAGGCAGGACAAGTATGGTCATTAAACAGAT 904 > . . . . . > 151 GCAGGCAGAAGGCATATATTGGCTCAGTCAAAGTGGGGAACTTTGGTGGC 200 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 905 GCAGGCAGAAGGCATATATTGGCTCAGTCAAAGTGGGGAACTTTGGTGGC 954 > . . . . . > 201 CAAACATACATTGCTAAGGCTATTCCTATATCAGCTGGACACATATAAAA 250 > |||||||| ||||||||||||||||||||||||||| ||||||||||||| > 955 CAAACATATATTGCTAAGGCTATTCCTATATCAGCTAGACACATATAAAA 1004 > . . . . . > 251 TGCTGCTAATGCTTCATTACAAACTTATATCCTTTAATTCCAGATGGGGG 300 > |||||| ||||||||||||||||||||||||||||||||||||||||||| > 1005 TGCTGCCAATGCTTCATTACAAACTTATATCCTTTAATTCCAGATGGGGG 1054 > . . . . . > 301 CAAAGTATGTCCAGGGGTGAGGAACAATTGAAACATTTGGGCTGGAGTAG 350 > ||||| |||||||||||||||||||||||||||||||||||||||||||| > 1055 CAAAGCATGTCCAGGGGTGAGGAACAATTGAAACATTTGGGCTGGAGTAG 1104 > . . . . . > 351 ATTTTGAAAGTCAGCTCTGTGTGTGTGTGTGTGTGTGCGCGCACGTGTGT 400 > |||||||||||||||| |||||||||||||||||||| | | | || > 1105 ATTTTGAAAGTCAGCTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCAGCGT 1154 > . > . > . > > JoeBiol_28007243.Seq;2 Check: 5,507 length: 400 from: 1 to: 400 > JoeBiol_28007243.Seq; Length: 400 April 25, 1990 10:50 Check: 5,507 > > Empri:Hsags01 Check: 1,418 length: 878 from: 1 to: 878 > Human A-gamma-S globin gene IVS-2 sequence. 8/84 > ID HSAGS01 standard; DNA; 878 BP. > AC X00672; > DT 15-AUG-1984 (first entry) > DE Human A-gamma-S globin gene IVS-2 sequence > KW globin. > OS Homo sapiens (human) > OC Eukaryota; Metazoa; Chordata; Vertebrata; Tetrapoda; Mammalia; > OC Eutheria; Primates. > RN [1] (bases 1-878) > RA Stoeckert C.J., Collins F.S., Weissman S.M.; . . . > Diagonal: 200 Range: -399/+400 > Gaps: 0 Quality: 400.0 Ratio: 1.000 > . . . . . > 1 ACAACTTTGACTTTGAGAAAAGAGAGGTGGAAATGAGGAAAATGACTTTT 50 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 201 ACAACTTTGACTTTGAGAAAAGAGAGGTGGAAATGAGGAAAATGACTTTT 250 > . . . . . > 51 CTGTATTAGATTCCAGTAGAAAGAACTTTCATCTTTCCCTCGTTTTTTTT 100 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 251 CTGTATTAGATTCCAGTAGAAAGAACTTTCATCTTTCCCTCGTTTTTTTT 300 > . . . . . > 101 GTTTTAAAACATCTATCTGGAGGCAGGACAAGTATGGTCGTTAAAAAGAT 150 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 301 GTTTTAAAACATCTATCTGGAGGCAGGACAAGTATGGTCGTTAAAAAGAT 350 > . . . . . > 151 GCAGGCAGAAGGCATATATTGGCTCAGTCAAAGTGGGGAACTTTGGTGGC 200 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 351 GCAGGCAGAAGGCATATATTGGCTCAGTCAAAGTGGGGAACTTTGGTGGC 400 > . . . . . > 201 CAAACATACATTGCTAAGGCTATTCCTATATCAGCTGGACACATATAAAA 250 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 401 CAAACATACATTGCTAAGGCTATTCCTATATCAGCTGGACACATATAAAA 450 > . . . . . > 251 TGCTGCTAATGCTTCATTACAAACTTATATCCTTTAATTCCAGATGGGGG 300 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 451 TGCTGCTAATGCTTCATTACAAACTTATATCCTTTAATTCCAGATGGGGG 500 > . . . . . > 301 CAAAGTATGTCCAGGGGTGAGGAACAATTGAAACATTTGGGCTGGAGTAG 350 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 501 CAAAGTATGTCCAGGGGTGAGGAACAATTGAAACATTTGGGCTGGAGTAG 550 > . . . . . > 351 ATTTTGAAAGTCAGCTCTGTGTGTGTGTGTGTGTGTGCGCGCACGTGTGT 400 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 551 ATTTTGAAAGTCAGCTCTGTGTGTGTGTGTGTGTGTGCGCGCACGTGTGT 600 > . > . > . > > JoeBiol_28007243.Seq;2 Check: 5,507 length: 400 from: 1 to: 400 > JoeBiol_28007243.Seq; Length: 400 April 25, 1990 10:50 Check: 5,507 > > Empri:Hsggl2 Check: 7,732 length: 1,628 from: 1 to: 1,628 > Human a gamma-globin gene. 3/83 > ID HSGGL2 standard; DNA; 1628 BP. > AC V00513; > DT 31-MAR-1983 (feature table expanded) > DT 17-FEB-1981 (first entry) > DE Human a gamma-globin gene. > KW gamma-globin; germ line; globin. > OS Homo sapiens (human) > OC Eukaryota; Metazoa; Chordata; Vertebrata; Tetrapoda; Mammalia; > OC Eutheria; Primates. > RN [1] (bases 1-1628) . . . > Diagonal: 750 Range: -399/+400 > Gaps: 1 Quality: 374.6 Ratio: 0.937 > . . . . . > 1 ACAACTTTGACTTTGAGAAAAGAGAGGTGGAAATGAGGAAAATGACTTTT 50 > |||||||||||||||||||||||||||||||||||||||||||||| > 751 ....CTTTGACTTTGAGAAAAGAGAGGTGGAAATGAGGAAAATGACTTTT 796 > . . . . . > 51 CTGTATTAGATTCCAGTAGAAAGAACTTTCATCTTTCCCTCGTTTTT... 97 > || ||||||||| | |||||||||||||||| ||||||| ||||| > 797 CTTTATTAGATTTCGGTAGAAAGAACTTTCACCTTTCCCCTATTTTTGTT 846 > . . . . . > 98 .TTTGTTTTAAAACATCTATCTGGAGGCAGGACAAGTATGGTCGTTAAAA 146 > || |||||||||||||||||||||||||||||||||||||||||||||| > 847 ATTCGTTTTAAAACATCTATCTGGAGGCAGGACAAGTATGGTCGTTAAAA 896 > . . . . . > 147 AGATGCAGGCAGAAGGCATATATTGGCTCAGTCAAAGTGGGGAACTTTGG 196 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 897 AGATGCAGGCAGAAGGCATATATTGGCTCAGTCAAAGTGGGGAACTTTGG 946 > . . . . . > 197 TGGCCAAACATACATTGCTAAGGCTATTCCTATATCAGCTGGACACATAT 246 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 947 TGGCCAAACATACATTGCTAAGGCTATTCCTATATCAGCTGGACACATAT 996 > . . . . . > 247 AAAATGCTGCTAATGCTTCATTACAAACTTATATCCTTTAATTCCAGATG 296 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 997 AAAATGCTGCTAATGCTTCATTACAAACTTATATCCTTTAATTCCAGATG 1046 > . . . . . > 297 GGGGCAAAGTATGTCCAGGGGTGAGGAACAATTGAAACATTTGGGCTGGA 346 > |||||||||||||||||||||||||||||||||||||||||||||||||| > 1047 GGGGCAAAGTATGTCCAGGGGTGAGGAACAATTGAAACATTTGGGCTGGA 1096 > . . . . . > 347 GTAGATTTTGAAAGTCAGCTCTGTGTGTGTGTGTGTGTGTGCGCGCACGT 396 > ||||||||||||||||||||||||||||||||||||||||| | | > 1097 GTAGATTTTGAAAGTCAGCTCTGTGTGTGTGTGTGTGTGTGTGTGTGTCA 1146 > . . . . . > 397 GTGT.............................................. 400 > | || > 1147 GCGTGTGTTTCTTTTAACGTCTTCAGCCTACAACATACAGGGTTCATGGT 1196 > . > . > . The first ten lines of a database entry are shown as well to allow you the identification of this entry by accession number and ID. Only the part of the database sequence that overlaps your query sequence is displayed. Identities are indicated by a "|". Retrieving Database Entries --------------------------- You can easily get a copy of matching sequences from the EMBL or GenBank databases by using the EMBL File Server. You should always use the accession number as given in the AC line of EMBL entries or the Accession line of GenBank entries. Send a mail message to NETSERV@EMBL-Heidelberg.DE containing one command per line. The general syntax is: GET NUC:accnumber eg. GET NUC:J00179 If you are new to the EMBL File Server, send a mail message to NETSERV@EMBL containing the line HELP to get some introductory information. The File Server offers the latest sequence data, several other databases and free software for molecular biology. The Algorithm ------------- The QUICKSEARCH algorithm was developed by John Devereux in collaboration with Chemical Abstract Service (CAS) [4]. An exact description of the underlying algorithm is beyond the scope of this help file. In general, each adjacent "word" in the database of length 20 is assigned an offset from the beginning of the database. A "hash code" is calculated for each word and the information is stored in a "hash table" which can be searched very quickly. The query sequence is compared to the database by looking up every 20-base region of the query sequence in the "hash table" of the database. To allow for some errors, a "windowing" technique is used, and a hit is counted when STRINGENCY or more words within a window of WINDOW words match. These "overlapping" regions are stored, and a second program - QUICKMATCH - then aligns the identified overlapping database sequences to the query sequence. WINDOW and STRINGENCY determine the sensitivity of the search. WINDOW is the number of consecutive 20-base sequences checked, STRINGENCY determines the number of these 20-base sequences that must match to define a "hit". WINDOW should be less than 1/20 of the search sequence length, though 15 is large enough for long sequences. STRINGENCY should be about half of WINDOW, though smaller values can be used in many cases. This allows for several single-base differences between the sequences being compared. The alignment algorithm used by QUICKMATCH is determined by the BEST parameter. By default, a Needleman-Wunsch alignment [3] is performed that finds the best global alignment between two sequences. The scores (so called "ratio") are calculated by using a comparison table that scores 1 for matches and 0 for mismatches, with gap weights of 5 and gap length weights of 0.1. If the BEST command is used in Mail-QUICKSEARCH then, instead of looking for the best global alignment, the best local alignments are calculated by a Smith-Waterman algorithm [2]. The comparison table used scores 1 for matches and -0.9 for mismatches, with gap weights of 5 and gap length weights of 0.1. Limitations and Considerations ------------------------------ Repetitive sequences or stretches of one nucleotide may strongly influence the results of a QUICKSEARCH run. Make sure to remove such regions from your query sequence. You can use short query sequences but QUICKSEARCH was designed to look for sequences longer than 200 bases. Problems -------- Please send any inquiries, questions or comments to NETHELP@EMBL-Heidelberg.DE Related services ---------------- QUICKSEARCH was developed for rapid searching of databases for identical or closely related DNA sequences. Therefore it is most useful for comparing newly determined unknown DNA sequences to the database to find out whether there are similar sequences already existing in the database. QUICKSEARCH is not suited for the detection of more distantly related sequences and cannot search for protein sequences. For this purpose use FASTA instead. An introduction to the EMBL Mail-FASTA service can be obtained by sending a mail message to FASTA@EMBL-Heidelberg.DE containing the line HELP The software ------------ Our service is based on modifications of the original QUICKSEARCH and QUICKSHOW programs in the GCG package. These new versions called NEWQUICKSEARCH and QUICKMATCH were written by Peter Rice, EMBL (RICE@EMBL-Heidelberg.DE). They are available free of charge from the EMBL File Server by sending a mail message containing the command GET VAX_SOFTWARE:GCGQUICK.UUE to NETSERV@EMBL-Heidelberg.DE If you are new to the EMBL File Server you can obtain introductory help by sending a mail file containing HELP HELP SOFTWARE HELP VAX_SOFTWARE to NETSERV@EMBL-Heidelberg.DE Literature ---------- [1] Devereux, J., Haeberli, P. and Smithies, O. A comprehensive set of sequence analysis programs for the VAX. Nucl. Acids Res. 12:387-395(1984). [2] Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 147:195-197(1981). [3] Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453(1970). [4] Devereux, J. (Ph. D. thesis): A rapid method for identifying sequences in large nucleotide sequence databases. 1988 (reprints available from GCG)
________________________________________