About CONSERV

Overview

CONSERV is a software to find contiguous conserved sequences among nucleic or amino acid sequences. It is fast enough for genome-scale data. Although it can only detect exact matches, it is very useful to get invariant conserved sequences among complete genomes, and so on.

It uses the generalized suffix tree data structure and Ukkonen's linear-time suffix tree construction algorithm.

It is written in the C language. It can be run on Linux, Solaris, and Tru64 UNIX, and it would be run on any UNIX.

Download

CONSERV-20051029.tar.gz

Install

Linux or UNIX with gcc

% make

After do make, copy two executable files "conserv" and "conserv_sort" to anywhere you like.

Solaris (Sparc v9 architecture) with Sun Workshop compiler

In Solaris on Sparc64 architecture, you can use Makefile.Solaris64bit insted of Makefile.

% make -f Makefile.Solaris64bit

Tru64 UNIX on DEC (Compaq, HP) Alpha (with cc)

% cc -O3 -o conserv conserv.c mfstlib3.c my_malloc.c ifilter.c case_ifilter.c only_ifilter.c xmask_ifilter.c
% cc -O3 -o conserv_sort conserv_sort.c mfstlib3.c my_malloc.c

Other environment

In 64-bit environment, you shall edit Makefile to specify 64-bit compliler and 64-bit compile options.

Applications

There are two programs: "conserv" and "conserv_sort".

conserv: Finds conserved sequences among biological sequences.
conserv_sort: Sorts result of the conserv program.

The main application is named "conserv". "conserv_sort" is a utility to sort result of the conserv program.

Example of usage

% conserv -L 100 -v 5 -n 10 -o out -i file1.fst -b file2.fst

(-L 100) reports conserved sequences longer than or equal to 100 bases.
(-v 5) reports conserved sequences that occur in at least 5 sequences.
(-n 10) reports conserved sequences that totally occur at least 10 times.
(-o out) specifies output files' prefix.
(-i file1.fst) specifies a input file.
(-b file2.fst) specifies a input file. The option "-b" means that the program analyzes both straight and complemental strands of all sequences read from the file.

% conserv -x amino -L 10 -v 3 -n 10 -o out -g+ -i Ecoli.fst -g+ -i Vcho1.fst -i Vcho2.fst -g+ -i Bsub.fst

(-x amino) specifies amino acid mode.
(-L 100) reports conserved sequences longer than or equal to 100 bases.
(-v 3) reports conserved sequences that occur in at least 3 sequence groups.
(-n 10) reports conserved sequences that totally occur at least 10 times.
(-o out) specifies output files' prefix.
(-g+ -i Ecoli.fst ...) The option "-g+" means "group mode". The program creates a new sequence group and all sequences after the option are belonged to the group. If you want to create a new group, you simply specify "-g+" one more times. In the example,
- all sequences from Ecoli.fst belong to group No.1
- all sequences from Vcho1.fst and Vcho2.fst belong to group No.2
- all sequences from Bsub.fst belong to group No.3

(to be written...)

Output file format

There are three output files for a single analysis. For example, if you specify "-o out", you get following three files with different suffixes.

out.seqinfo: Information of input sequences.
out.position: Positions on which the detected conserved sequences occurred.
out.fst: Conserved sequences (FASTA format)

Format of out.seqinfo

Each input sequence have serial number.

(to be written...)

Format of out.position

An example of a single data block (note that it is not real data but only for explanation) is:

n=7 L=16 v=4 ---- No_23 CGGGGGTTCGAATCCC
1 663374
2 984389
2 c68931 ( 2c 973574 )
3 c361877 ( 3c 680628 )
4 424938
5 c1096801 ( 5c 3542406 )
5 c925120 ( 5c 3714087 )

The first line "n=7 L=16 v=4 ---- No_23 CGGGGGTTCGAATCCC" indicates the summary of the data.

"n=7" means that the conserved sequence occurs 7 times in the input sequences.
"L=16" means that the conserved sequence is 16-bp in length.
"v=4" means that the 4 sequences or sequences groups have the conserved sequence.

Note that the sequence "CGGGGGTTCGAATCCC" will not be shown if the detected conserved sequence is very long. You should use out.fst to get sequences.

The line "1 663374" means that the conserved sequence exists at the position 663374 of the input sequence No. 1. Note that the first position of the input sequence is 1.

The line "2 c68931 ( 2c 973574 )" means that the conserved sequence exists on the complemental strand of the input sequence No. 2 at the position 68931. Note that the position 68931 is a position number of the straight strand. It might be shown as "complement(68931..68946)" in GenBank feature format.

(to be written...)

About conserv_sort

conserv_sort sorts output files.

Example of usage

% conserv_sort out.position

You should specify an output file whose suffix is ".position".

Command-line Options

conserv

 Usage: conserv [option...] input files...

 -L nnn              minimum length (default 100)
 -v nnn              minimum number of sequences or groups (default 2)
 -n nnn              minimum appearance count (default 1)
 -i file.fst         read sequence(s) from file.fst
 -c file.fst         read sequence(s) from file.fst, using complemental strand
 -b file.fst         read sequence(s) from file.fst, using both strands
 -t FILTER[=option]  add a input filter, or change filter settings
 -o file             specify output file prefix
 -C | -B             using {complemental | both} strands in all input sequence
                     (except files specifyed with -i or -c or -b options)
 -g +                group mode enabled, and change group
 -g -                group mode disabled, (1 sequence 1 group)
 -g f                group mode enabled, 1 file 1 group

 -x mix              mixed output format (default: separate)
 -x {fopen|popen}    input files are {normal files(default)|popen}
 -a | -F             append | overwrite output file
                     (default: not append nor overwrite)
 -S nnn              maximum length to show sequence in output file
                     (0 don't show, -1 show all) (default 50)

 Input files:  *.fst
 Output files: file.seqinfo file.position file.fst (normal),
            or file (-x mix)

 default loaded input filter: case=upper

 Available input filters:
     case: convert case (uppercase or lowercase)
         case={upper|lower|-}
             upper       uppercase
             lower       lowercase
             -           not convert (disable this filter)
     xmask: mask specified letters
         xmask[=[length][,letters]]
             length      minimum length of repeat
             letters     letters to be masked
         default options: 10,XN (xmask=10,XN)
     only: restricts to specifyed characters
         only[=characters]]
             characters   mask characters
         default options: ACGTUacgtu (only=ACGTUacgtu)

conserv_sort

 Options:
  -h | --help      print help message
  -verbose         verbose message to stderr
  -more-verbose    more verbose
  -force-overwrite overwrite output files
  -out-path=DIR    output files to DIR (default: same as input files)
  -out-basename=X  basename of output filenames
  -out-ext=X       string added to output filenames (default=.sorted)
  -input-ext=X     input file extension (default=.position)
  -mix | -mixed    mixed mode input
  -DNA             DNA mode: delete same position data
  -sort={+|-|0}    sort order of whole data.   +:ascend -:descend 0: don't sort
  -sortkey=XY...   sort keys(X) and orders(Y): X={L|v|n|s|p|c} Y={+|-|0}
  -isort={+|-|0}   sort order of single data.  +:ascend -:descend 0: don't sort
  -isortkey=XY...  sort keys(X) and orders(Y): X={s|p|c} Y={+|-|0}
  --               end of options (using if capital letter of filename is '-')

  -out-stdout      output to stdout
  -out-mixed={yes|no|auto} output mixed mode (default=auto)
  -ext-fst=.FST    fst file extension (default=.fst)
  -ext-seqinfo=.SEQINFO seqinfo file extension (default=.seqinfo)
  -out-ext-position=.POSITON output position file extension
                             (default: same as input file)
  -out-ext-fst=.FST output fst file extension (default: same as -ext-fst=)
  -out-ext-seqinfo=.SEQINFO output seqinfo file extension
                            (default: same as -ext-seqinfo=)
  -out-ext-mixed=.MIXED output mixed file extension
                        (default=mixed or same as input file)

  sort keys:
    L     L(length)
    v     v(number of hit sequences or sequence groups)
    n     n(number of hits)
    s     seq. No.
    p     position
    c     strand(complement or not)

 Default sort options:
  -sort=+ -sortkey=L-v-n-s+p+c+ -isort=+ -isortkey=s+p+c+

 Input files:   file.position and file.fst
                or file and file.fst
                or stdin (only for mixed mode)
 Output files:  file.sorted.position and file.sorted.fst
                or file.sorted and file.sorted.fst
                or stdout (only for mixed mode)
                ("file" can be changed with option -out-basename=)
                (".sorted" can be changed with option -out-ext=)

(More details will be written...)

License

GPL

References

Naohisa Goto, Ken Kurokawa, and Teruo Yasunaga (2007) Analysis of invariant sequences in 266 complete genomes. Gene, 401, 172-180.

Naohisa Goto: ngoto (at) gen-info.osaka-u.ac.jp

Last modified: Tue Jun 24 21:10:44 JST 2008