- Unix system with RepeatMasker 3.2.0 or higher installed
- A perl script included in the COSEG package depends on the RepeatMasker 3.2.0 (and higher) release. RepeatMasker installation instructions can be found
- Sequence Search Engine
COSEG uses alignment data generated by a sequence search engine.
Cross_match and WUBlast are supported by COSEG.
- GraphViz Program [optional]
COSEG outputs tree data in the GraphViz format. Installing the GraphViz
program will allow you to convert this format into many different
graphics file formats for viewing.
- Download COSEG
Latest Released Version: 8/3/2010:
Previous Version: 8/8/2008:
- Unpack Distribution
Unpack the distribution in your home directory or in a temporary location ( ie. /tmp ).
- cd /mytmp/location/
- gunzip coseg-#.#.#.tar.gz
- tar xvf coseg-#.#.#.tar
To compile the C program run:
See tutorial below for running COSEG on the provided example
- Edit the Makefile and change INSTALLDIR to the location where you
want the files installed.
- run: "make install"
ALU Example Run
- Given the sample ALU dataset provided by Alkes Price in the
original distribution of his code ( files have been renamed ):
- ALU.seqs: Human sequence data from alignment to AluSx consensus )
- ALU.ins : Insertion sequences from the alignments
- ALU.cons: AluSx consensus
Note: that positions 116-135 (inclusive) are
lowercase while all remaining positions are uppercase.
This is how we encode positions within the consensus
that have lower quality and should not be considered
in this analysis.
- Run analysis:
runcoseg.pl -d -m 50 -c ALU.cons -s ALU.seqs -i ALU.ins
NOTE: In this run we must tell coseg to treat lower case
characters in the consensus as a blacklist designation
by using the "-d" flag.
- Create png/svg files:
The visualization output of coseg is in the GraphViz format. To
convert this data into a graphics different graphics format you
can use the "dot" program in the GraphViz package. The dot program
will creae hierarchical graph from the data.
dot ALU.seqs.tree.viz -Tpng -o ALU.seqs.tree.viz.png
dot ALU.seqs.tree.viz -Tsvg -o ALU.seqs.tree.viz.svg
NOTE: Most modern browsers will display the SVG graphics
format without a plugin.
The following files are named after the aligned sequence file
name as a prefix.
- *.log -- A log of mutation sites found and the order of the
- *.subfamililies.seq -- name, count, P-value and consensus sequence of
each subfamily found by our algorithm. (For subfamilies not in the
original scaffold, we also include in parentheses the P-value of
the scaffold subfamily from which it is derived).
- *.assign -- for each of the elements, lists the
subfamily to which the algorithm has assigned it.
- *.tree.viz -- evolutionary tree of the subfamilies, in GRAPHVIZ format.
Running Using Your Own Data
- Cross_match a reference sequence against a genome or database.
cross_match line1copies consensus -M 25p41g.matrix
-gap_init -25 -gap_ext -5 -minscore 200
-minmatch 6 -alignments -bandwidth 50 -word_raw > LINE1
The example file LINE1 included in this distribution was created
using the command line above and can be used directly in the following
- Determine consensus range to use for analysis ( ie. 298 - 797 bp )
- Create input files to alkes programs:
preprocessAlignments.pl -maxEdgeGap 10
This will create 3 new files: LINE1.seqs
NOTE: Use the -w flag to preprocessAlignments if you use WUBlast
to perform the alignments.
- Run analysis:
runcoseg.pl -t -m 50 -c LINE1.cons -s LINE1.seqs
NOTE: In this example we use two new flags. The first "-a" indicates
we want to use the newer pvalue calculation developed by
Andy Siegel. The second flag "-t" indicates we want to use
3 bp co-segregating mutations as well as 2bp co-segregating
mutations when developing subfamilies.
- Create png/svg files
dot LINE1.seqs.tree.viz -Tpng -o LINE1.seqs.tree.viz.png
dot LINE1.seqs.tree.viz -Tsvg -o LINE1.seqs.tree.viz.svg
- Open up a web browser and point it at the file LINE1.seqs.tree.viz.svg.
Most browsers support zooming in on svg files. If you want to render .viz
file larger by default simply edit the *.viz file and change the
to something larger. I.e:
- Improved code documentation
- Single mutation significance cutoff ( SIGMATHRESH ) was pre-calculated for Alkes Alu analysis and hardcoded. This version calculates the correct sigma cutoff using the length of the input sequence.
- Fixed bug with implementation of Siegel's pValue calculation which caused a segfault -- found by Neal Platt.
- Switched default pvalue method to Andy Siegel's method and provided a new "-k" switch to use Alkes Price's method.
- Fixed bug where the program was exiting when calculations fell below the precision of the machine ( epsilon ). Message given was "Below epsilon..." and the runcoseg.pl script moved on even though coseg failed.