Genome Information Research Center, Osaka Univ.

GenMark e-mail server


                                  GenMark
                            Version 1.26ux Beta
                         M. Borodovsky, J. McIninch
                          ------------------------

	The GenMark server provides an E-Mail based gene finding ser-
	vice using the GenMark algorithm (based on non-stationary Markov
	chain models) which was developped at the Georgia Institute of
	Technology (Atlanta, GA) and at the Institute of Molecular
	Genetics (Moscow).

	The GenMark server accepts messages containing DNA sequences in
	a very simple format. The submitter may specify the nature of
	the algorithm used by species name and specify parameters affect-
	ing the complexity of the analysis.


I.   SENDING DATA TO GENMARK
------------------------------------------------------------------------------
	Sending data to GenMark was designed to be as simple as possible.
Sequences to be analyzed by GenMark must be sent via e-mail to the following
network mail address:

	genmark@ford.gatech.edu

... with the word 'genmark' in the 'Subject:' field of the mail message (an
example follows later in this text). There is only one mandatory keyword and
that is the word 'data' which must preceed the actual text of the sequence to
be analyzed. Try and make sure that after the keyword 'data' there is no text
which is not to be interpretted sequence information (any alphabetic charcters
it sees it will assume are part of the sequence).

	How you get the sequence text into your mail message will depened on
the computer setup that you are using to send and receive electronic mail. If
you are unsure about how to include a text file into a mail message, contact
the computer coordinator or somebody versed in the nuances of sending e-mail.
Below is an example of sending a a hypothetical sequence named JM082671 to
GenMark from a UNIX-based computer (note, the '%' symbol is the prompt used
by the system):

% mail genmark@ford.gatech.edu
Subject: genmark
data
~r JM082671
"JM082671" 100/7600 included.
.
(EOT)
%

	When GenMark receives this message, it will analyze it and send back
a minimal response. In this case, it will use data based on E. coli (the
default organism) to analyze the sequence and send back an assessment of the
open reading frames that seem likely candidates as genes. You may, however,
wish to get more information to get a better picture of what's going on. In
that case, read the segment below about giving options and commands to
GenMark.


II.  OPTIONS AND COMMANDS
------------------------------------------------------------------------------
	GenMark has a large number of options that you may activate by spec-
ifying them in your e-mail message before the keyword 'data'. It is suggested
that before using these options with genmark, you should read this section
very carefully. GenMark is not case-senstive with regard to the options that
it recognizes.

	These are all the keywords recognized by GenMark (in order of relative
importance), followed by explanations for each keyword:

	data		species		title		psgraph
	protein		nucseq		orflist		address
	name		via		order		window
	step		threshold	#		;

Explanations:

"DATA"
	This is the only mandatory keyword for messages sent to GenMark.
	It is interpretted by GenMark as an indication that all text that
	follows is nucleotide sequence information to be analyzed. Only
	letters are recognized, so any numbers, spaces, or puctuation will
	be filtered out before the sequence is analyzed.

"SPECIES"
	This permits the user to specify what matrix is used to analyze
	the submitted sequence. After the keyword 'species' the user should
	place the name of the matrix he wants to use (there's a list of
	available matrices at the end of this document). There may be a
	number of matrices available for a single species, or even one
	matrix that represents several. The default species is 'ecoli'.

"TITLE"
	This specifies the title you wish to give the sequence. This string
	appears in all responses from GenMark and is very important in
	indentifying which response corresponds to which sequence. The
	default title is ''.

"PSGRAPH"
	This instructs GenMark to send graphical output of its results to
	you in the form of a PostScript file which may be printed on any
	PostScript compatible laser printer or viewed with programs such as
	GhostScript. The 'psgraph' option also allows for an optional arg-
	ument '-r' (or, 'region'), which instructs GenMark to draw gray
	bars indicated regions suggested for homology search (ie., use
	'psgraph region'). By default, this feature is turned off.
	NOTE: PostScript responses can be quite large; for every 2K of
	you are likely to receive 12K of text. So, keep large sequences to
	a minimum when using this feature.

"PROTEIN"
	The option 'protein' instructs GenMark to do something with trans-
	lated segments of the sequence wherein GenMark has predicted the
	presence of a coding region. 'protein' allows for three arguments:
	'region', 'orf', or 'send' (which may be abbreviated '-r', '-o',
	and '-s' respectively). The arguments may be combined. The argument
	'region' instructs GenMark to reply with a list of the regions
	containing possible coding regions from stop codon to stop codon.
	The 'orf' argument asks GenMark to reply with a list of open
	reading frames (splice site to splice site) where a coding region
	is predicted. The 'send' option instructs GenMark to send the
	open reading frames in which coding regions are predicted to an
	NCBI/NIH BLAST server for immediate homology search. By default,
	this option is turned off. 'protein' by itself is regarded to be
	the same as specifying 'protein region'.

"NUCSEQ"
	This option operates EXACTLY as 'protein' does with the exception
	that it operates with nucleotides sequences instead of protein
	translations. This also means that the program 'blastn' is used
	rather than 'blastp' for homology searches initiated by the '-s'
	argument.

"ORFLIST"
	This allows the user to specify the format of the open reading frame
	list sent GenMark. 'orflist' permits three optional arguments:
	'region', 'orf', or 'off' ('region' may be abbreviated '-r', and
	'orf', '-o'). The argument 'off' instructs GenMark not to reply with
	an open reading frame list. 'region' instructs GenMark to cite
	regions (from stop codon to stop codon) in which a coding region was
	predicted. 'orf' instructs GenMark to reply with a complete and
	detailed assessment of regions between splice sites that are predicted
	to contain coding regions. By default, GenMark treats all requests
	as if 'orflist orf' was specified.

"ADDRESS"
	This allows the user to specify an alternate e-mail address for
	GenMark to reply to. This may be important if you need to direct the
	mail somewhere else, or if GenMark has a difficult time trying to
	extract your e-mail address from the header of the electronic mail
	it receives. By default, GenMark sends replies back to their origin.

"NAME"
	This allows the user to specify their name. The name, if available,
	is tacked onto replies from GenMark. This may be important for
	situations where a number of people are using the same e-mail address.
	This way, the reply may be readily identified as to whom it belongs.

"VIA"
	The 'via' option instructs GenMark to query a foreign computer
	regarding information that it has processed. Currently, the 'via'
	options supports two arguments: 'blastp' and 'blastn'. 'blastp'
	sends the results from GenMark to BLAST for a protein homology
	search, and 'blastn' sends the results from GenMark to BLAST for
	nucleotide homology search. 'via blastp' is equivalent to using
	the options 'psgraph region', 'protein send', and
        'orflist region orf'. 'via blastn' is equivalent to using the options
	'psgraph region', 'nucseq send', and 'orflist region orf'.

"ORDER"
	Allows the user to specify the order of the Markov chain employed by
	GenMark to analyze the sequence (see references). By default, GenMark
	will use what is the optimal order for the request. GenMark will only
	permit the use of certain orders of matrices and if your request
	doesn't fall within the range available for that species, the optimal
	order will be used. Generally speaking, it should be unnecessary to
	specify this parameter.

"WINDOW"
	This option allows the user to specify the length of the analysis
	window (in nucleotides) that GenMark uses in its algorithm. The
	default is 96 and is generally the best value. Smaller values will
	produce a higher rate of false signals, but may show smaller coding
	regions. Larger values may not permit GenMark to recognize smaller
	coding regions, but will clarify the coding signal of larger regions.
	Generally speaking, it shouldn't be necessary to modify this value.

"STEP"
	This option allows the user to specify the step, in nucleotides, used
	by the algorithm to advance the scanning analysis window. By default,
	the step value is 12. It should not, generally, be necessary to
	modify this value. NOTE: Changing this value will alter the resolution
	of the graphical output. Decreasing this value inscreases the graph
	resolution and increases the size of the of the PostScript output.

"THRESHOLD"
	This option allows the user to specify a value between 0 and 1 which
	will be used to judge whether a give region should be predicted as
	coding. By default, this threshold is 0.6 (60%), meaning that any
	region yielding a sustained coding probability above 60% will be
	judged coding. The value 0.6 provides very good results, however, you
	may wish to alter this value to suit you taste.

"#", or ";"
	This marks a comment. This line is to be ignored.


III. AN EXAMPLE OF SENDING A SEQUENCE TO GENMARK USING EVERY OPTION
------------------------------------------------------------------------------
	Here's an example of a hypothetical sequence, 'JM082671', being sent
to GenMark from a UNIX-based machine and employing all the options detailed
above (note, '%' is the prompt used by the system):

% mail genmark@ford.gatech.edu
Subject: genmark
# NOTE: the '#' and ';' symbols must be at the begining of
# the line to indicate that that line is a comment.

title Sequence A, JM082671
species ecoli
window 96
step 12
threshold 0.5
name John Doe
address j_doe@really.big.lab.com
order 4
protein region orf send
nucseq region orf send
psgraph region
orflist orf region
data
~r JM082671
"JM082671" 100/7600 included.
.
(EOT)
%

NOTE: GenMark is not case sensitive, and it doesn't care what order the arg-
uments to functions appear in.


IV.  INTERPRETTING THE RESULTS
------------------------------------------------------------------------------
Open Reading Frame Assessments

	Assessments are given in the form of a chart which may differ depend-
ing on the type of organism (eg, eukaryote versus prokaryote). The chart in-
dicates in the first two columns the left and right ends of the open reading
frame. The third column indicates the region's orientation and frame. A fourth
column indicates the mean probability function in that interval (a percent
result based on Bayes' theorem).

	For prokaryotic sequences, there are two more columns.  For each ORF,
a value should appear in one of those columns and is an evaluation of the
probability that that end of the region is the true start of that open reading
frame.

	For eukaryotic sequences, there are two left and two right ends
indicated for each predicted coding region. They represent possible splice
sites. The probability value represents the mean probability function on the
interval between the innermost two splice sites.

Graphs

	The graphs are laid out as six different reading frames. The top three
panels represent the sequence in the direct orientation, and the bottom three
represent the sequence in the complementary direction.

	The y-axis of each panel ranges from 0 to 1 and is considered the
probability that a given segment of the sequence is coding (based on stat-
istics derived from an appropriate database).

	The x-axis of each panel represents the nucleotide poisitions within
the sequence. The nucleotide positions are indicated at the bottom of the
page.

	If you selected the '-r' or 'region' option pertaining to the graph,
you will see grey bars on the page indicating regions in which GenMark has
predicted that a gene may reside. The probability function need not be very
strong for this indication to appear. These bars indicate the regions that
would be sent to BLAST if you used the 'send' or '-s' arguments to the
'protein' or 'nucseq' options.

	If the graph was analyzed using prokaryotic data, then a series of
marks should appear at the 0.5 level in each panel denoting start and stop
marks in that frame. A 1.5mm stroke up from the 0.5 level indicates the codon
ATG, a 0.75mm stroke up from the 0.5 level indicates the codon GTG, and a
1.5mm downward stroke from the 0.5 level indicates one of the stop codons:
TGA, TAG, or TAA. A horizontal bar along the 0.5 level indicates an open
reading frame extending from a start codon to a stop codon.

	If the graph was analyzed using eukaryotic information, then a 1.5mm
stroke at the 0.5 level indicates a stop codon, a '<' symbol at the 0.5 level
indicates the beginning of a splice region, and a '>' symbol at the 0.5 level
indicates the end of a possible splice region.

	Printing PostScript graphics will depened on the computer and
operating system that you are using to print them. The text sent by GenMark
is a PostScript program which should be dumped directly into a PostScript
compatible printer. You may view PostScript files directly on you computer
screen using a number of public domain, and commercial packages. A package
called GhostView is available in the public domain for a number of computers
for viewing GhostScript files. Those running Solaris may want to use the
utility 'pageview' which comes with the operating system, Mac users may want
to use a commercial package called Freedom of the Press, and Amiga users may
want to use public domain utility called 'Post'.

	Before printing the graphs, look at the file you have saved your mail
message to. There should be no information preceeding the line that says
'%!PS-Adobe'. If there is, you must delete everything before that line before
it will print properly.


V.   REGISTRATION
------------------------------------------------------------------------------
	You may become a registered user of GenMark by sending a mail message
to the GenMark server with the word 'registration' in the subject line of the
message, or on the first line of your message.

	Please include, your name, e-mail address, organization, and research
interests in the body of your message.


VI.  REPORTING ERRORS / CONTACTING THE AUTHORS
------------------------------------------------------------------------------
	You may contact the authors of this program to ask questions, send
suggestions, make bug reports, future feature requests, and to inquire about
site licenses by sending electronic mail to:

	mb56@prism.gatech.edu

... or to:

	gt1619a@prism.gatech.edu


Appendix A. - MATRICES AVAILABLE FOR USE
------------------------------------------------------------------------------
	This is a list of the matrices currently available for use by GenMark.
The matrix is specified using the 'species' option. Use 'species '
where  is the name of one of the matrices listed below:

Matrix Name	Basis
-----------	--------------------------------------------------------------
ecoli		E. coli moderately to highly expressed genes (default)
hiexpress	E. coli very highly expressed (constituitive) genes
ecophage	Coliphage, plasmid, and horizontally transferred E. coli genes
human		Human DNA samples (small set)


Addendum A. - NOTICE OF LIMIT ON SIZE OF NCBI SIMILARITY SEARCHES
------------------------------------------------------------------------------
	At the request of NCBI, we have placed certain limits on the size of
sequences we can accept for automatic analysis by BLAST.

	Currently, sequences less than 5,000 base pairs long are analyzed by
GenMark and sent immediately to the BLAST service for a similarity search.
Sequences between 5,000 and 10,000 base pairs in length are held in a queue to
be sent to BLAST during the night. Any sequence longer than 10,000 base pairs
is not permitted to be automatically sent to BLAST.

	Sequences longer than 10,000 base pairs long will have their predicted
coding regions sent to you directly. Once you have received these sequences,
you may review them and manually send interesting sequences to BLAST (the
sequences you receive are already in the format accepted by BLAST, just cut
the approriate sequence out and paste it into your BLAST query).

	We are sorry or any inconvenience...
				- Dr M Borodovsky, mb56@prism.gatech.edu
				  J McIninch, gt1619a@prism.gatech.edu
________________________________________