GenMark Version 1.26ux Beta M. Borodovsky, J. McIninch ------------------------ The GenMark server provides an E-Mail based gene finding ser- vice using the GenMark algorithm (based on non-stationary Markov chain models) which was developped at the Georgia Institute of Technology (Atlanta, GA) and at the Institute of Molecular Genetics (Moscow). The GenMark server accepts messages containing DNA sequences in a very simple format. The submitter may specify the nature of the algorithm used by species name and specify parameters affect- ing the complexity of the analysis. I. SENDING DATA TO GENMARK ------------------------------------------------------------------------------ Sending data to GenMark was designed to be as simple as possible. Sequences to be analyzed by GenMark must be sent via e-mail to the following network mail address: genmark@ford.gatech.edu ... with the word 'genmark' in the 'Subject:' field of the mail message (an example follows later in this text). There is only one mandatory keyword and that is the word 'data' which must preceed the actual text of the sequence to be analyzed. Try and make sure that after the keyword 'data' there is no text which is not to be interpretted sequence information (any alphabetic charcters it sees it will assume are part of the sequence). How you get the sequence text into your mail message will depened on the computer setup that you are using to send and receive electronic mail. If you are unsure about how to include a text file into a mail message, contact the computer coordinator or somebody versed in the nuances of sending e-mail. Below is an example of sending a a hypothetical sequence named JM082671 to GenMark from a UNIX-based computer (note, the '%' symbol is the prompt used by the system): % mail genmark@ford.gatech.edu Subject: genmark data ~r JM082671 "JM082671" 100/7600 included. . (EOT) % When GenMark receives this message, it will analyze it and send back a minimal response. In this case, it will use data based on E. coli (the default organism) to analyze the sequence and send back an assessment of the open reading frames that seem likely candidates as genes. You may, however, wish to get more information to get a better picture of what's going on. In that case, read the segment below about giving options and commands to GenMark. II. OPTIONS AND COMMANDS ------------------------------------------------------------------------------ GenMark has a large number of options that you may activate by spec- ifying them in your e-mail message before the keyword 'data'. It is suggested that before using these options with genmark, you should read this section very carefully. GenMark is not case-senstive with regard to the options that it recognizes. These are all the keywords recognized by GenMark (in order of relative importance), followed by explanations for each keyword: data species title psgraph protein nucseq orflist address name via order window step threshold # ; Explanations: "DATA" This is the only mandatory keyword for messages sent to GenMark. It is interpretted by GenMark as an indication that all text that follows is nucleotide sequence information to be analyzed. Only letters are recognized, so any numbers, spaces, or puctuation will be filtered out before the sequence is analyzed. "SPECIES" This permits the user to specify what matrix is used to analyze the submitted sequence. After the keyword 'species' the user should place the name of the matrix he wants to use (there's a list of available matrices at the end of this document). There may be a number of matrices available for a single species, or even one matrix that represents several. The default species is 'ecoli'. "TITLE" This specifies the title you wish to give the sequence. This string appears in all responses from GenMark and is very important in indentifying which response corresponds to which sequence. The default title is ''. "PSGRAPH" This instructs GenMark to send graphical output of its results to you in the form of a PostScript file which may be printed on any PostScript compatible laser printer or viewed with programs such as GhostScript. The 'psgraph' option also allows for an optional arg- ument '-r' (or, 'region'), which instructs GenMark to draw gray bars indicated regions suggested for homology search (ie., use 'psgraph region'). By default, this feature is turned off. NOTE: PostScript responses can be quite large; for every 2K of you are likely to receive 12K of text. So, keep large sequences to a minimum when using this feature. "PROTEIN" The option 'protein' instructs GenMark to do something with trans- lated segments of the sequence wherein GenMark has predicted the presence of a coding region. 'protein' allows for three arguments: 'region', 'orf', or 'send' (which may be abbreviated '-r', '-o', and '-s' respectively). The arguments may be combined. The argument 'region' instructs GenMark to reply with a list of the regions containing possible coding regions from stop codon to stop codon. The 'orf' argument asks GenMark to reply with a list of open reading frames (splice site to splice site) where a coding region is predicted. The 'send' option instructs GenMark to send the open reading frames in which coding regions are predicted to an NCBI/NIH BLAST server for immediate homology search. By default, this option is turned off. 'protein' by itself is regarded to be the same as specifying 'protein region'. "NUCSEQ" This option operates EXACTLY as 'protein' does with the exception that it operates with nucleotides sequences instead of protein translations. This also means that the program 'blastn' is used rather than 'blastp' for homology searches initiated by the '-s' argument. "ORFLIST" This allows the user to specify the format of the open reading frame list sent GenMark. 'orflist' permits three optional arguments: 'region', 'orf', or 'off' ('region' may be abbreviated '-r', and 'orf', '-o'). The argument 'off' instructs GenMark not to reply with an open reading frame list. 'region' instructs GenMark to cite regions (from stop codon to stop codon) in which a coding region was predicted. 'orf' instructs GenMark to reply with a complete and detailed assessment of regions between splice sites that are predicted to contain coding regions. By default, GenMark treats all requests as if 'orflist orf' was specified. "ADDRESS" This allows the user to specify an alternate e-mail address for GenMark to reply to. This may be important if you need to direct the mail somewhere else, or if GenMark has a difficult time trying to extract your e-mail address from the header of the electronic mail it receives. By default, GenMark sends replies back to their origin. "NAME" This allows the user to specify their name. The name, if available, is tacked onto replies from GenMark. This may be important for situations where a number of people are using the same e-mail address. This way, the reply may be readily identified as to whom it belongs. "VIA" The 'via' option instructs GenMark to query a foreign computer regarding information that it has processed. Currently, the 'via' options supports two arguments: 'blastp' and 'blastn'. 'blastp' sends the results from GenMark to BLAST for a protein homology search, and 'blastn' sends the results from GenMark to BLAST for nucleotide homology search. 'via blastp' is equivalent to using the options 'psgraph region', 'protein send', and 'orflist region orf'. 'via blastn' is equivalent to using the options 'psgraph region', 'nucseq send', and 'orflist region orf'. "ORDER" Allows the user to specify the order of the Markov chain employed by GenMark to analyze the sequence (see references). By default, GenMark will use what is the optimal order for the request. GenMark will only permit the use of certain orders of matrices and if your request doesn't fall within the range available for that species, the optimal order will be used. Generally speaking, it should be unnecessary to specify this parameter. "WINDOW" This option allows the user to specify the length of the analysis window (in nucleotides) that GenMark uses in its algorithm. The default is 96 and is generally the best value. Smaller values will produce a higher rate of false signals, but may show smaller coding regions. Larger values may not permit GenMark to recognize smaller coding regions, but will clarify the coding signal of larger regions. Generally speaking, it shouldn't be necessary to modify this value. "STEP" This option allows the user to specify the step, in nucleotides, used by the algorithm to advance the scanning analysis window. By default, the step value is 12. It should not, generally, be necessary to modify this value. NOTE: Changing this value will alter the resolution of the graphical output. Decreasing this value inscreases the graph resolution and increases the size of the of the PostScript output. "THRESHOLD" This option allows the user to specify a value between 0 and 1 which will be used to judge whether a give region should be predicted as coding. By default, this threshold is 0.6 (60%), meaning that any region yielding a sustained coding probability above 60% will be judged coding. The value 0.6 provides very good results, however, you may wish to alter this value to suit you taste. "#", or ";" This marks a comment. This line is to be ignored. III. AN EXAMPLE OF SENDING A SEQUENCE TO GENMARK USING EVERY OPTION ------------------------------------------------------------------------------ Here's an example of a hypothetical sequence, 'JM082671', being sent to GenMark from a UNIX-based machine and employing all the options detailed above (note, '%' is the prompt used by the system): % mail genmark@ford.gatech.edu Subject: genmark # NOTE: the '#' and ';' symbols must be at the begining of # the line to indicate that that line is a comment. title Sequence A, JM082671 species ecoli window 96 step 12 threshold 0.5 name John Doe address j_doe@really.big.lab.com order 4 protein region orf send nucseq region orf send psgraph region orflist orf region data ~r JM082671 "JM082671" 100/7600 included. . (EOT) % NOTE: GenMark is not case sensitive, and it doesn't care what order the arg- uments to functions appear in. IV. INTERPRETTING THE RESULTS ------------------------------------------------------------------------------ Open Reading Frame Assessments Assessments are given in the form of a chart which may differ depend- ing on the type of organism (eg, eukaryote versus prokaryote). The chart in- dicates in the first two columns the left and right ends of the open reading frame. The third column indicates the region's orientation and frame. A fourth column indicates the mean probability function in that interval (a percent result based on Bayes' theorem). For prokaryotic sequences, there are two more columns. For each ORF, a value should appear in one of those columns and is an evaluation of the probability that that end of the region is the true start of that open reading frame. For eukaryotic sequences, there are two left and two right ends indicated for each predicted coding region. They represent possible splice sites. The probability value represents the mean probability function on the interval between the innermost two splice sites. Graphs The graphs are laid out as six different reading frames. The top three panels represent the sequence in the direct orientation, and the bottom three represent the sequence in the complementary direction. The y-axis of each panel ranges from 0 to 1 and is considered the probability that a given segment of the sequence is coding (based on stat- istics derived from an appropriate database). The x-axis of each panel represents the nucleotide poisitions within the sequence. The nucleotide positions are indicated at the bottom of the page. If you selected the '-r' or 'region' option pertaining to the graph, you will see grey bars on the page indicating regions in which GenMark has predicted that a gene may reside. The probability function need not be very strong for this indication to appear. These bars indicate the regions that would be sent to BLAST if you used the 'send' or '-s' arguments to the 'protein' or 'nucseq' options. If the graph was analyzed using prokaryotic data, then a series of marks should appear at the 0.5 level in each panel denoting start and stop marks in that frame. A 1.5mm stroke up from the 0.5 level indicates the codon ATG, a 0.75mm stroke up from the 0.5 level indicates the codon GTG, and a 1.5mm downward stroke from the 0.5 level indicates one of the stop codons: TGA, TAG, or TAA. A horizontal bar along the 0.5 level indicates an open reading frame extending from a start codon to a stop codon. If the graph was analyzed using eukaryotic information, then a 1.5mm stroke at the 0.5 level indicates a stop codon, a '<' symbol at the 0.5 level indicates the beginning of a splice region, and a '>' symbol at the 0.5 level indicates the end of a possible splice region. Printing PostScript graphics will depened on the computer and operating system that you are using to print them. The text sent by GenMark is a PostScript program which should be dumped directly into a PostScript compatible printer. You may view PostScript files directly on you computer screen using a number of public domain, and commercial packages. A package called GhostView is available in the public domain for a number of computers for viewing GhostScript files. Those running Solaris may want to use the utility 'pageview' which comes with the operating system, Mac users may want to use a commercial package called Freedom of the Press, and Amiga users may want to use public domain utility called 'Post'. Before printing the graphs, look at the file you have saved your mail message to. There should be no information preceeding the line that says '%!PS-Adobe'. If there is, you must delete everything before that line before it will print properly. V. REGISTRATION ------------------------------------------------------------------------------ You may become a registered user of GenMark by sending a mail message to the GenMark server with the word 'registration' in the subject line of the message, or on the first line of your message. Please include, your name, e-mail address, organization, and research interests in the body of your message. VI. REPORTING ERRORS / CONTACTING THE AUTHORS ------------------------------------------------------------------------------ You may contact the authors of this program to ask questions, send suggestions, make bug reports, future feature requests, and to inquire about site licenses by sending electronic mail to: mb56@prism.gatech.edu ... or to: gt1619a@prism.gatech.edu Appendix A. - MATRICES AVAILABLE FOR USE ------------------------------------------------------------------------------ This is a list of the matrices currently available for use by GenMark. The matrix is specified using the 'species' option. Use 'species ' where is the name of one of the matrices listed below: Matrix Name Basis ----------- -------------------------------------------------------------- ecoli E. coli moderately to highly expressed genes (default) hiexpress E. coli very highly expressed (constituitive) genes ecophage Coliphage, plasmid, and horizontally transferred E. coli genes human Human DNA samples (small set) Addendum A. - NOTICE OF LIMIT ON SIZE OF NCBI SIMILARITY SEARCHES ------------------------------------------------------------------------------ At the request of NCBI, we have placed certain limits on the size of sequences we can accept for automatic analysis by BLAST. Currently, sequences less than 5,000 base pairs long are analyzed by GenMark and sent immediately to the BLAST service for a similarity search. Sequences between 5,000 and 10,000 base pairs in length are held in a queue to be sent to BLAST during the night. Any sequence longer than 10,000 base pairs is not permitted to be automatically sent to BLAST. Sequences longer than 10,000 base pairs long will have their predicted coding regions sent to you directly. Once you have received these sequences, you may review them and manually send interesting sequences to BLAST (the sequences you receive are already in the format accepted by BLAST, just cut the approriate sequence out and paste it into your BLAST query). We are sorry or any inconvenience... - Dr M Borodovsky, mb56@prism.gatech.edu J McIninch, gt1619a@prism.gatech.edu
________________________________________