******** Announcement of the NetGene Mail-server: *********
DESCRIPTION:
The NetGene mail server is a service producing neural network
predictions of splice sites in vertebrate genes as described in:
Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of
Human mRNA Donor and Acceptor Sites from the DNA Sequence. Journal
of Molecular Biology, 220, 49-65.
ABSTRACT OF JMB ARTICLE:
Artificial neural networks have been applied to the prediction of
splice site location in human pre-mRNA. A joint prediction scheme
where prediction of transition regions between introns and exons
regulates a cutoff level for splice site assignment was able to
predict splice site locations with confidence levels far better than
previously reported in the literature. The problem of predicting
donor and acceptor sites in human genes is hampered by the presence
of numerous amounts of false positives - in the paper the
distribution of these false splice sites is examined and linked to a
possible scenario for the splicing mechanism in vivo. When the
presented method detects 95% of the true donor and acceptor sites it
makes less than 0.1% false donor site assignments and less than 0.4%
false acceptor site assignments. For the large data set used in this
study this means that on the average there are one and a half false
donor sites per true donor site and six false acceptor sites per true
acceptor site. With the joint assignment method more than a fifth of
the true donor sites and around one fourth of the true acceptor sites
could be detected without accompaniment of any false positive
predictions. Highly confident splice sites could not be isolated
with a widely used weight matrix method or by separate splice site
networks. A complementary relation between the confidence levels of
the coding/non-coding and the separate splice site networks was
observed, with many weak splice sites having sharp transitions in the
coding/non-coding signal and many stronger splice sites having more
ill-defined transitions between coding and non-coding.
INSTRUCTIONS:
In order to use the NetGene mail-server:
1) Prepare a file with the sequence in a format similar to the fasta
format: the first line must start with the symbol '>', the next
word on that line is used as the sequence identifier. The
following lines should contain the actual sequence, consisting of
the symbols A, T, U, G, C and N. U is converted to T, letters not
mentioned are converted to N. All letters are converted to upper
case. Numbers, blanks and other nonletter symbols are skipped.
The lines should not be longer than 80 characters. The minimum
length analyzed is 451 nucleotides, and the maximum is 100000
nucleotides (your mail system may have a lower limit for the
maximum size of a message). Due to the non-local nature of the
algorithm sites closer than 225 nucleotides to the ends of the
sequence will not be assigned.
2) Mail the file to netgene@virus.fki.dth.dk. The response time will
depend on system load. If nothing else is running on the machine
the speed is about 1000 nucleotides/min. It may take several
hours before you get the answer, so please do not resubmit a job
if you get no answer within a short while.
REFERENCING AND FURTHER INFORMATION
Publication of output from NetGene must be referenced as follows:
Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of
Human mRNA Donor and Acceptor Sites from the DNA Sequence. Journal
of Molecular Biology, 220, 49-65.
CONFIDENTIALITY
Your submitted sequence will be deleted automatically immediately
after processing by NetGene.
PROBLEMS AND SUGGESTIONS:
Should be addressed to:
Jacob Engelbrecht
e-mail: engel@virus.fki.dth.dk
Department of Physical Chemistry
The Technical University of Denmark
Building 206
DK-2800 Lyngby
Denmark
phone: +45 4288 2222 ext. 2478 (operator)
phone: +45 4593 1222 ext. 2478 (tone)
fax: +45 4593 4808
EXAMPLE:
A file test.seq is prepared with an editor with the following contents:
>HUMOPS
GGATCCTGAGTACCTCTCCTCCCTGACCTCAGGCTTCCTCCTAGTGTCACCTTGGCCCCTCTTAGAAGC
CAATTAGGCCCTCAGTTTCTGCAGCGGGGATTAATATGATTATGAACACCCCCAATCTCCCAGATGCTG
. Here come more lines with sequence.
.
.
This is sent to the NetGene mail-server, on a Unix system like this:
mail netgene@virus.fki.dth.dk < test.seq
In return an answer similar to this is produced:
>From netgene@virus.fki.dth.dk Fri Mar 20 13:30 MET 1992
Received: by virus.fki.dth.dk
(16.7/16.2) id AA05624; Fri, 20 Mar 92 13:30:41 +0100
Date: Fri, 20 Mar 92 13:30:41 +0100
From: virus mail server
Return-Path:
To: engel@virus.fki.dth.dk
Subject: HUMOPS: NetGene splice site prediction
Status: RO
------------------------------------------------------------------------
NetGene
Neural Network Prediction of Splice Sites
Reference:
Brunak, S., Engelbrecht, J., and Knudsen, S. (1991). Prediction of
Human mRNA donor and acceptor sites from the DNA sequence. Journal of
Molecular Biology 220:49-65.
------------------------------------------------------------------------
Report ERRORS to Jacob Engelbrecht engel@virus.fki.dth.dk.
Potential splice sites are assigned by combining output from a local and
a global network. The prediction is made with two cutoffs: 1) Highly
confident sites (no or few false positives, on average 50% of the true
sites detected); 2) Nearly all true sites (more false positives - on
average of all positions 0.1% false positive donor sites and 0.4% false
positive acceptor sites, at 95% detection of true sites). The network
performance on sequences from distantly related organisms has not been
quantified. Due to the non-local nature of the algorithm sites closer
than 225 nucleotides to the ends of the sequence cannot be assigned.
Column explanations, field identifiers:
POSITION in your sequence (either first or last base in intron).
Joint CONFIDENCE level for the site (relative to the cutoff).
EXON INTRON gives 20 bases of sequence around the predicted site.
LOCAL is the site confidence from the local network.
GLOBAL is the site confidence from the global network.
------------------------------------------------------------------------
The sequence: HUMOPS contains 6953 bases, and has the following composition:
A 1524 C 2022 G 1796 T 1611
1) HIGHLY CONFIDENT SITES:
==========================
ACCEPTOR SITES:
POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL
4094 0.27 TGTCCTGCAG^GCCGCTGCCC 0.63 0.66
5167 0.20 TGCCTTCCAG^TTCCGGAACT 0.59 0.64
3812 0.17 CTGTCCTCAG^GTACATCCCC 0.68 0.54
3164 0.02 TCCTCCTCAG^TCTTGCTAGG 0.79 0.32
2438 0.01 TGCCTTGCAG^GTGAAATTGC 0.78 0.33
DONOR SITES:
POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL
3979 0.38 CGTCAAGGAG^GTACGGGCCG 0.92 0.74
2608 0.17 GCTGGTCCAG^GTAATGGCAC 0.85 0.54
4335 0.06 GAACAAGCAG^GTGCCTACTG 0.83 0.41
2) NEARLY ALL TRUE SITES:
=========================
ACCEPTOR SITES:
POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL
4094 0.55 TGTCCTGCAG^GCCGCTGCCC 0.63 0.66
3812 0.52 CTGTCCTCAG^GTACATCCCC 0.68 0.54
3164 0.49 TCCTCCTCAG^TCTTGCTAGG 0.79 0.32
5167 0.49 TGCCTTCCAG^TTCCGGAACT 0.59 0.64
2438 0.48 TGCCTTGCAG^GTGAAATTGC 0.78 0.33
4858 0.39 TCATCCATAG^AAAGGTAGAA 0.77 0.20
3712 0.36 CCTTTTCCAG^GGAGGGAATG 0.88 -0.01
4563 0.33 CCCTCCACAG^GTGGCTCAGA 0.81 0.05
5421 0.33 TTTTTTTAAG^AAATAATTAA 0.75 0.13
3783 0.29 TCCCTCACAG^GCAGGGTCTC 0.64 0.26
3173 0.25 GTCTTGCTAG^GGTCCATTTC 0.52 0.36
4058 0.24 CTCCCTGGAG^GAGCCATGGT 0.43 0.51
1784 0.22 TCACTGTTAG^GAATGTCCCA 0.68 0.08
6512 0.21 CCCTTGCCAG^ACAAGCCCAT 0.67 0.08
2376 0.20 CCCTGTCTAG^GGGGGAGTGC 0.61 0.16
1225 0.18 CCCCTCTCAG^CCCCTGTCCT 0.65 0.07
1743 0.13 TTCTCTGCAG^GGTCAGTCCC 0.62 0.03
3834 0.13 GGGCCTGCAG^TGCTCGTGTG 0.26 0.58
4109 0.13 TGCCCAGCAG^CAGGAGTCAG 0.29 0.54
6557 0.13 CATTCTGGAG^AATCTGCTCC 0.56 0.12
1638 0.11 CCATTCTCAG^GGAATCTCTG 0.62 0.00
247 0.10 GCCTTCGCAG^CATTCTTGGG 0.55 0.11
6766 0.09 CTATCCACAG^GATAGATTGA 0.64 -0.06
906 0.08 AATTTCACAG^CAAGAAAACT 0.61 -0.02
6499 0.08 CAGTTTCCAG^TTTCCCTTGC 0.55 0.06
378 0.07 GTACCCACAG^TACTACCTGG 0.24 0.52
3130 0.07 CTGTCTCCAG^AAAATTCCCA 0.51 0.12
4272 0.07 ACCATCCCAG^CGTTCTTTGC 0.58 0.00
4522 0.07 TGAATCTCAG^GGTGGGCCCA 0.51 0.12
5722 0.07 ACCCTCGCAG^CAGCAGCAAC 0.55 0.05
2316 0.06 CTTCCCCAAG^GCCTCCTCAA 0.40 0.27
2357 0.06 GCCTTCCTAG^CTACCCTCTC 0.39 0.28
2908 0.06 TTTGGTCTAG^TACCCCGGGG 0.51 0.10
4112 0.06 CCAGCAGCAG^GAGTCAGCCA 0.25 0.50
1327 0.05 TTTGCTTTAG^AATAATGTCT 0.52 0.06
844 0.04 GTTTGTGCAG^GGCTGGCACT 0.62 -0.11
1045 0.04 TCCCTTGGAG^CAGCTGTGCT 0.54 0.01
1238 0.03 CTGTCCTCAG^GTGCCCCTCC 0.50 0.06
2976 0.03 CCTAGTGCAG^GTGGCCATAT 0.62 -0.12
3825 0.03 CATCCCCGAG^GGCCTGCAGT 0.16 0.60
1508 0.02 TGAGATGCAG^GAGGAGACGC 0.43 0.16
2257 0.02 CTCTCCTCAG^CGTGTGGTCC 0.53 0.00
5712 0.02 ATCCTCTCAG^ACCCTCGCAG 0.51 0.05
2397 0.00 CCCTCCTTAG^GCAGTGGGGT 0.41 0.16
4800 0.00 CATTTTCTAG^CTGTATGGCC 0.47 0.07
5016 0.00 TGCCTAGCAG^GTTCCCACCA 0.59 -0.11
DONOR SITES:
POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL
3979 0.75 CGTCAAGGAG^GTACGGGCCG 0.92 0.74
2608 0.51 GCTGGTCCAG^GTAATGGCAC 0.85 0.54
4335 0.38 GAACAAGCAG^GTGCCTACTG 0.83 0.41
656 0.32 ACCCTGGGCG^GTATGAGCCG 0.56 0.66
5859 0.11 ACCAAAAGAG^GTGTGTGTGT 0.85 0.07
4585 0.09 GCTCACTCAG^GTGGGAGAAG 0.86 0.03
1708 0.06 TGGCCAGAAG^GTGGGTGTGC 0.85 0.01
6196 0.05 CCCAATGAGG^GTGAGATTGG 0.86 -0.01
667 0.03 TATGAGCCGG^GTGTGGGTGG 0.23 0.71
------------------------------------------------------------------------
________________________________________