******** Announcement of the NetGene Mail-server: ********* DESCRIPTION: The NetGene mail server is a service producing neural network predictions of splice sites in vertebrate genes as described in: Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of Human mRNA Donor and Acceptor Sites from the DNA Sequence. Journal of Molecular Biology, 220, 49-65. ABSTRACT OF JMB ARTICLE: Artificial neural networks have been applied to the prediction of splice site location in human pre-mRNA. A joint prediction scheme where prediction of transition regions between introns and exons regulates a cutoff level for splice site assignment was able to predict splice site locations with confidence levels far better than previously reported in the literature. The problem of predicting donor and acceptor sites in human genes is hampered by the presence of numerous amounts of false positives - in the paper the distribution of these false splice sites is examined and linked to a possible scenario for the splicing mechanism in vivo. When the presented method detects 95% of the true donor and acceptor sites it makes less than 0.1% false donor site assignments and less than 0.4% false acceptor site assignments. For the large data set used in this study this means that on the average there are one and a half false donor sites per true donor site and six false acceptor sites per true acceptor site. With the joint assignment method more than a fifth of the true donor sites and around one fourth of the true acceptor sites could be detected without accompaniment of any false positive predictions. Highly confident splice sites could not be isolated with a widely used weight matrix method or by separate splice site networks. A complementary relation between the confidence levels of the coding/non-coding and the separate splice site networks was observed, with many weak splice sites having sharp transitions in the coding/non-coding signal and many stronger splice sites having more ill-defined transitions between coding and non-coding. INSTRUCTIONS: In order to use the NetGene mail-server: 1) Prepare a file with the sequence in a format similar to the fasta format: the first line must start with the symbol '>', the next word on that line is used as the sequence identifier. The following lines should contain the actual sequence, consisting of the symbols A, T, U, G, C and N. U is converted to T, letters not mentioned are converted to N. All letters are converted to upper case. Numbers, blanks and other nonletter symbols are skipped. The lines should not be longer than 80 characters. The minimum length analyzed is 451 nucleotides, and the maximum is 100000 nucleotides (your mail system may have a lower limit for the maximum size of a message). Due to the non-local nature of the algorithm sites closer than 225 nucleotides to the ends of the sequence will not be assigned. 2) Mail the file to netgene@virus.fki.dth.dk. The response time will depend on system load. If nothing else is running on the machine the speed is about 1000 nucleotides/min. It may take several hours before you get the answer, so please do not resubmit a job if you get no answer within a short while. REFERENCING AND FURTHER INFORMATION Publication of output from NetGene must be referenced as follows: Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of Human mRNA Donor and Acceptor Sites from the DNA Sequence. Journal of Molecular Biology, 220, 49-65. CONFIDENTIALITY Your submitted sequence will be deleted automatically immediately after processing by NetGene. PROBLEMS AND SUGGESTIONS: Should be addressed to: Jacob Engelbrecht e-mail: engel@virus.fki.dth.dk Department of Physical Chemistry The Technical University of Denmark Building 206 DK-2800 Lyngby Denmark phone: +45 4288 2222 ext. 2478 (operator) phone: +45 4593 1222 ext. 2478 (tone) fax: +45 4593 4808 EXAMPLE: A file test.seq is prepared with an editor with the following contents: >HUMOPS GGATCCTGAGTACCTCTCCTCCCTGACCTCAGGCTTCCTCCTAGTGTCACCTTGGCCCCTCTTAGAAGC CAATTAGGCCCTCAGTTTCTGCAGCGGGGATTAATATGATTATGAACACCCCCAATCTCCCAGATGCTG . Here come more lines with sequence. . . This is sent to the NetGene mail-server, on a Unix system like this: mail netgene@virus.fki.dth.dk < test.seq In return an answer similar to this is produced: >From netgene@virus.fki.dth.dk Fri Mar 20 13:30 MET 1992 Received: by virus.fki.dth.dk (16.7/16.2) id AA05624; Fri, 20 Mar 92 13:30:41 +0100 Date: Fri, 20 Mar 92 13:30:41 +0100 From: virus mail serverReturn-Path: To: engel@virus.fki.dth.dk Subject: HUMOPS: NetGene splice site prediction Status: RO ------------------------------------------------------------------------ NetGene Neural Network Prediction of Splice Sites Reference: Brunak, S., Engelbrecht, J., and Knudsen, S. (1991). Prediction of Human mRNA donor and acceptor sites from the DNA sequence. Journal of Molecular Biology 220:49-65. ------------------------------------------------------------------------ Report ERRORS to Jacob Engelbrecht engel@virus.fki.dth.dk. Potential splice sites are assigned by combining output from a local and a global network. The prediction is made with two cutoffs: 1) Highly confident sites (no or few false positives, on average 50% of the true sites detected); 2) Nearly all true sites (more false positives - on average of all positions 0.1% false positive donor sites and 0.4% false positive acceptor sites, at 95% detection of true sites). The network performance on sequences from distantly related organisms has not been quantified. Due to the non-local nature of the algorithm sites closer than 225 nucleotides to the ends of the sequence cannot be assigned. Column explanations, field identifiers: POSITION in your sequence (either first or last base in intron). Joint CONFIDENCE level for the site (relative to the cutoff). EXON INTRON gives 20 bases of sequence around the predicted site. LOCAL is the site confidence from the local network. GLOBAL is the site confidence from the global network. ------------------------------------------------------------------------ The sequence: HUMOPS contains 6953 bases, and has the following composition: A 1524 C 2022 G 1796 T 1611 1) HIGHLY CONFIDENT SITES: ========================== ACCEPTOR SITES: POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL 4094 0.27 TGTCCTGCAG^GCCGCTGCCC 0.63 0.66 5167 0.20 TGCCTTCCAG^TTCCGGAACT 0.59 0.64 3812 0.17 CTGTCCTCAG^GTACATCCCC 0.68 0.54 3164 0.02 TCCTCCTCAG^TCTTGCTAGG 0.79 0.32 2438 0.01 TGCCTTGCAG^GTGAAATTGC 0.78 0.33 DONOR SITES: POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL 3979 0.38 CGTCAAGGAG^GTACGGGCCG 0.92 0.74 2608 0.17 GCTGGTCCAG^GTAATGGCAC 0.85 0.54 4335 0.06 GAACAAGCAG^GTGCCTACTG 0.83 0.41 2) NEARLY ALL TRUE SITES: ========================= ACCEPTOR SITES: POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL 4094 0.55 TGTCCTGCAG^GCCGCTGCCC 0.63 0.66 3812 0.52 CTGTCCTCAG^GTACATCCCC 0.68 0.54 3164 0.49 TCCTCCTCAG^TCTTGCTAGG 0.79 0.32 5167 0.49 TGCCTTCCAG^TTCCGGAACT 0.59 0.64 2438 0.48 TGCCTTGCAG^GTGAAATTGC 0.78 0.33 4858 0.39 TCATCCATAG^AAAGGTAGAA 0.77 0.20 3712 0.36 CCTTTTCCAG^GGAGGGAATG 0.88 -0.01 4563 0.33 CCCTCCACAG^GTGGCTCAGA 0.81 0.05 5421 0.33 TTTTTTTAAG^AAATAATTAA 0.75 0.13 3783 0.29 TCCCTCACAG^GCAGGGTCTC 0.64 0.26 3173 0.25 GTCTTGCTAG^GGTCCATTTC 0.52 0.36 4058 0.24 CTCCCTGGAG^GAGCCATGGT 0.43 0.51 1784 0.22 TCACTGTTAG^GAATGTCCCA 0.68 0.08 6512 0.21 CCCTTGCCAG^ACAAGCCCAT 0.67 0.08 2376 0.20 CCCTGTCTAG^GGGGGAGTGC 0.61 0.16 1225 0.18 CCCCTCTCAG^CCCCTGTCCT 0.65 0.07 1743 0.13 TTCTCTGCAG^GGTCAGTCCC 0.62 0.03 3834 0.13 GGGCCTGCAG^TGCTCGTGTG 0.26 0.58 4109 0.13 TGCCCAGCAG^CAGGAGTCAG 0.29 0.54 6557 0.13 CATTCTGGAG^AATCTGCTCC 0.56 0.12 1638 0.11 CCATTCTCAG^GGAATCTCTG 0.62 0.00 247 0.10 GCCTTCGCAG^CATTCTTGGG 0.55 0.11 6766 0.09 CTATCCACAG^GATAGATTGA 0.64 -0.06 906 0.08 AATTTCACAG^CAAGAAAACT 0.61 -0.02 6499 0.08 CAGTTTCCAG^TTTCCCTTGC 0.55 0.06 378 0.07 GTACCCACAG^TACTACCTGG 0.24 0.52 3130 0.07 CTGTCTCCAG^AAAATTCCCA 0.51 0.12 4272 0.07 ACCATCCCAG^CGTTCTTTGC 0.58 0.00 4522 0.07 TGAATCTCAG^GGTGGGCCCA 0.51 0.12 5722 0.07 ACCCTCGCAG^CAGCAGCAAC 0.55 0.05 2316 0.06 CTTCCCCAAG^GCCTCCTCAA 0.40 0.27 2357 0.06 GCCTTCCTAG^CTACCCTCTC 0.39 0.28 2908 0.06 TTTGGTCTAG^TACCCCGGGG 0.51 0.10 4112 0.06 CCAGCAGCAG^GAGTCAGCCA 0.25 0.50 1327 0.05 TTTGCTTTAG^AATAATGTCT 0.52 0.06 844 0.04 GTTTGTGCAG^GGCTGGCACT 0.62 -0.11 1045 0.04 TCCCTTGGAG^CAGCTGTGCT 0.54 0.01 1238 0.03 CTGTCCTCAG^GTGCCCCTCC 0.50 0.06 2976 0.03 CCTAGTGCAG^GTGGCCATAT 0.62 -0.12 3825 0.03 CATCCCCGAG^GGCCTGCAGT 0.16 0.60 1508 0.02 TGAGATGCAG^GAGGAGACGC 0.43 0.16 2257 0.02 CTCTCCTCAG^CGTGTGGTCC 0.53 0.00 5712 0.02 ATCCTCTCAG^ACCCTCGCAG 0.51 0.05 2397 0.00 CCCTCCTTAG^GCAGTGGGGT 0.41 0.16 4800 0.00 CATTTTCTAG^CTGTATGGCC 0.47 0.07 5016 0.00 TGCCTAGCAG^GTTCCCACCA 0.59 -0.11 DONOR SITES: POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL 3979 0.75 CGTCAAGGAG^GTACGGGCCG 0.92 0.74 2608 0.51 GCTGGTCCAG^GTAATGGCAC 0.85 0.54 4335 0.38 GAACAAGCAG^GTGCCTACTG 0.83 0.41 656 0.32 ACCCTGGGCG^GTATGAGCCG 0.56 0.66 5859 0.11 ACCAAAAGAG^GTGTGTGTGT 0.85 0.07 4585 0.09 GCTCACTCAG^GTGGGAGAAG 0.86 0.03 1708 0.06 TGGCCAGAAG^GTGGGTGTGC 0.85 0.01 6196 0.05 CCCAATGAGG^GTGAGATTGG 0.86 -0.01 667 0.03 TATGAGCCGG^GTGTGGGTGG 0.23 0.71 ------------------------------------------------------------------------
________________________________________