CONSERV is a software to find contiguous conserved sequences among nucleic or amino acid sequences. It is fast enough for genome-scale data. Although it can only detect exact matches, it is very useful to get invariant conserved sequences among complete genomes, and so on.
It uses the generalized suffix tree data structure and Ukkonen's linear-time suffix tree construction algorithm.
It is written in the C language. It can be run on Linux, Solaris, and Tru64 UNIX, and it would be run on any UNIX.
% make
After do make, copy two executable files "conserv" and "conserv_sort" to anywhere you like.
In Solaris on Sparc64 architecture, you can use Makefile.Solaris64bit insted of Makefile.
% make -f Makefile.Solaris64bit
% cc -O3 -o conserv conserv.c mfstlib3.c my_malloc.c ifilter.c case_ifilter.c only_ifilter.c xmask_ifilter.c % cc -O3 -o conserv_sort conserv_sort.c mfstlib3.c my_malloc.c
In 64-bit environment, you shall edit Makefile to specify 64-bit compliler and 64-bit compile options.
There are two programs: "conserv" and "conserv_sort".
The main application is named "conserv". "conserv_sort" is a utility to sort result of the conserv program.
% conserv -L 100 -v 5 -n 10 -o out -i file1.fst -b file2.fst
% conserv -x amino -L 10 -v 3 -n 10 -o out -g+ -i Ecoli.fst -g+ -i Vcho1.fst -i Vcho2.fst -g+ -i Bsub.fst
There are three output files for a single analysis. For example, if you specify "-o out", you get following three files with different suffixes.
Each input sequence have serial number.
(to be written...)An example of a single data block (note that it is not real data but only for explanation) is:
n=7 L=16 v=4 ---- No_23 CGGGGGTTCGAATCCC 1 663374 2 984389 2 c68931 ( 2c 973574 ) 3 c361877 ( 3c 680628 ) 4 424938 5 c1096801 ( 5c 3542406 ) 5 c925120 ( 5c 3714087 )
The first line "n=7 L=16 v=4 ---- No_23 CGGGGGTTCGAATCCC" indicates the summary of the data.
Note that the sequence "CGGGGGTTCGAATCCC" will not be shown if the detected conserved sequence is very long. You should use out.fst to get sequences.
The line "1 663374" means that the conserved sequence exists at the position 663374 of the input sequence No. 1. Note that the first position of the input sequence is 1.
The line "2 c68931 ( 2c 973574 )" means that the conserved sequence exists on the complemental strand of the input sequence No. 2 at the position 68931. Note that the position 68931 is a position number of the straight strand. It might be shown as "complement(68931..68946)" in GenBank feature format.
(to be written...)% conserv_sort out.positionYou should specify an output file whose suffix is ".position".
Usage: conserv [option...] input files... -L nnn minimum length (default 100) -v nnn minimum number of sequences or groups (default 2) -n nnn minimum appearance count (default 1) -i file.fst read sequence(s) from file.fst -c file.fst read sequence(s) from file.fst, using complemental strand -b file.fst read sequence(s) from file.fst, using both strands -t FILTER[=option] add a input filter, or change filter settings -o file specify output file prefix -C | -B using {complemental | both} strands in all input sequence (except files specifyed with -i or -c or -b options) -g + group mode enabled, and change group -g - group mode disabled, (1 sequence 1 group) -g f group mode enabled, 1 file 1 group -x mix mixed output format (default: separate) -x {fopen|popen} input files are {normal files(default)|popen} -a | -F append | overwrite output file (default: not append nor overwrite) -S nnn maximum length to show sequence in output file (0 don't show, -1 show all) (default 50) Input files: *.fst Output files: file.seqinfo file.position file.fst (normal), or file (-x mix) default loaded input filter: case=upper Available input filters: case: convert case (uppercase or lowercase) case={upper|lower|-} upper uppercase lower lowercase - not convert (disable this filter) xmask: mask specified letters xmask[=[length][,letters]] length minimum length of repeat letters letters to be masked default options: 10,XN (xmask=10,XN) only: restricts to specifyed characters only[=characters]] characters mask characters default options: ACGTUacgtu (only=ACGTUacgtu)
Options: -h | --help print help message -verbose verbose message to stderr -more-verbose more verbose -force-overwrite overwrite output files -out-path=DIR output files to DIR (default: same as input files) -out-basename=X basename of output filenames -out-ext=X string added to output filenames (default=.sorted) -input-ext=X input file extension (default=.position) -mix | -mixed mixed mode input -DNA DNA mode: delete same position data -sort={+|-|0} sort order of whole data. +:ascend -:descend 0: don't sort -sortkey=XY... sort keys(X) and orders(Y): X={L|v|n|s|p|c} Y={+|-|0} -isort={+|-|0} sort order of single data. +:ascend -:descend 0: don't sort -isortkey=XY... sort keys(X) and orders(Y): X={s|p|c} Y={+|-|0} -- end of options (using if capital letter of filename is '-') -out-stdout output to stdout -out-mixed={yes|no|auto} output mixed mode (default=auto) -ext-fst=.FST fst file extension (default=.fst) -ext-seqinfo=.SEQINFO seqinfo file extension (default=.seqinfo) -out-ext-position=.POSITON output position file extension (default: same as input file) -out-ext-fst=.FST output fst file extension (default: same as -ext-fst=) -out-ext-seqinfo=.SEQINFO output seqinfo file extension (default: same as -ext-seqinfo=) -out-ext-mixed=.MIXED output mixed file extension (default=mixed or same as input file) sort keys: L L(length) v v(number of hit sequences or sequence groups) n n(number of hits) s seq. No. p position c strand(complement or not) Default sort options: -sort=+ -sortkey=L-v-n-s+p+c+ -isort=+ -isortkey=s+p+c+ Input files: file.position and file.fst or file and file.fst or stdin (only for mixed mode) Output files: file.sorted.position and file.sorted.fst or file.sorted and file.sorted.fst or stdout (only for mixed mode) ("file" can be changed with option -out-basename=) (".sorted" can be changed with option -out-ext=)(More details will be written...)
GPL
Naohisa Goto, Ken Kurokawa, and Teruo Yasunaga (2007) Analysis of invariant sequences in 266 complete genomes. Gene, 401, 172-180.