###
### COSEG
###

Background
==========

  A program to identify repeat subfamilies using significant
 co-segregating ( 2-3 bp ) mutations.

   This program is derived from three C programs and several perl
 scripts written by Alkes Price as part of an analysis of Alu 
 elements in the human genome ( Whole-genome analysis of Alu repeat
 elements reveals complex evolutionary history, Alkes L. Price,
 Eleazar Eskin, and Pavel Pevzner, 2004 Genome Research ). The program
 was first adapted for use with other repeat families and then
 extended to support consideration of three co-segregating mutations
 using Alkes statistical model.  In 2008 with the help of Andy Siegel
 an alternative statistical model was developed and the codebase
 repackaged into the single source file.

 There are a few caveats:  

  - The input sequences must be full length alignments to a *single*
    reference sequence.

  - The longer the sequence the harder this problem is to solve.  Shorter
    families or short subregions are recommended for this process.

 Current Authors: Robert Hubley <rhubley@systemsbiology.org>
                  Andy Siegel
                  Arian Smit

 Original Code Authors: Alkes Price

 The original code can be found here:
            http://www.cse.ucsd.edu/~ppevzner/download/alucode.tar.gz


Installation
============

  1. Build COSEG c program
         make
  2. Edit preprocessAlignments.pl
      
       Change line 114:

              use lib "/usr/local/RepeatMasker";
  
       To point to your RepeatMasker installation.


 
ALU Example Run:

  1. Given the sample ALU dataset provided by Alkes Price in the
     original distribution of his code ( files have been renamed ):

            ALU.seqs: Human sequence data from alignment to AluSx consensus )
            ALU.ins : Insertion sequences from the alignments 
            ALU.cons: AluSx consensus 
                      Note: that positions 116-135 (inclusive) are 
                      lowercase while all remaining positions are uppercase.
                      This is how we encode positions within the consensus
                      that have lower quality and should not be considered
                      in this analysis.

  2. Run analysis:

         runcoseg.pl -k -d -m 50 -c ALU.cons -s ALU.seqs -i ALU.ins

         NOTE: In this run we must tell coseg to treat lower case
               characters in the consensus as a blacklist designation
               by using the "-d" flag.  Also we are using the 
               original pValue calculation as defined by Alkes et al.
               by specifying the "-k" flag to closely approximate the
               functioning of Alkes original program.
 
  3. Create png/svg files
 
       The visualization output of coseg is in the GraphViz format.  To 
       convert this data into a graphics different graphics format you
       can use the "dot" program in the GraphViz package. The dot program
       will creae hierarchical graph from the data.

       ie:
         dot ALU.seqs.tree.viz -Tpng -o ALU.seqs.tree.viz.png

         dot ALU.seqs.tree.viz -Tsvg -o ALU.seqs.tree.viz.svg

         NOTE: Most modern browsers will display the SVG graphics
               format without a plugin.

Output Files  
============

  The following files are named after the aligned sequence file
  name as a prefix.

  *.log -- A log of mutation sites found and the order of the
           clustering.
  *.subfamililies.seq -- name, count, P-value and consensus sequence of
         each subfamily found by our algorithm.  (For subfamilies not in the
         original scaffold, we also include in parentheses the P-value of
         the scaffold subfamily from which it is derived).
  *.assign -- for each of the elements, lists the
         subfamily to which the algorithm has assigned it.
  *.tree.viz -- evolutionary tree of the subfamilies, in GRAPHVIZ format.


Input Files 
===========

  *.cons -- Consensus File:
    The consensus used to generate the aligned seqeunce file.  This is
    in fasta format.

  *.seqs -- Aligned Sequences File:
    Near full length aligned sequences to the consensus file ( one per line ).
    Deletions are represented by 1 or more "-" characters.  Insertions of 
    any length are represented by a single "+" character.

  *.ins -- Inserts File:
    Inserts pulled from the alignments are stored in this file.  The 
    aligned sequence file ordering is used in this file.  Each insert
    is encoded as:

            [<position>:<sequence>][<position>:<sequence>]..

    ie.
             3:AA 10:AGTTTA

  A script is provided to build these files automatically given either
  wu-blast or cross_match alignment files -- see below.



Running using your own data
===========================

  1. Cross_match a reference sequence against a genome or database. 

       cross_match line1copies consensus -M 25p41g.matrix 
                   -gap_init -25 -gap_ext -5 -minscore 200 
                   -minmatch 6 -alignments -bandwidth 50 -word_raw > LINE1

       The example file LINE1 included in this distribution was created
       using the command line above and can be used directly in the following
       steps.

  2. Determine consensus range to use for analysis ( ie. 298 - 797 bp )

  3. Create input files to alkes programs:

       preprocessAlignments.pl -maxEdgeGap 10
                                -minConsRange 298
                                -maxConsRange 797
                                -alignments LINE1

       This will create 3 new files: LINE1.seqs
                                     LINE1.ins
                                     LINE1.cons
      
       NOTE: Use the -w flag to preprocessAlignments if you use WUBlast
             to perform the alignments.

  4. Run analysis:

       runcoseg.pl -t -m 50 -c LINE1.cons -s LINE1.seqs 
                   -i LINE1.ins

       NOTE: In this examplewe use the "-t" flag to indicate we want to use
             3 bp co-segregating mutations as well as 2bp co-segregating 
             mutations when developing subfamilies.  Also we use the
             new pValue calculation ( default ) by leaving out the "-k"
             flag.

  5. Create png/svg files
 
       dot LINE1.seqs.tree.viz -Tpng -o LINE1.seqs.tree.viz.png

       dot LINE1.seqs.tree.viz -Tsvg -o LINE1.seqs.tree.viz.svg

  6. Open up a web browser and point it at the file LINE1.seqs.tree.viz.svg.
     Most browsers support zooming in on svg files.  If you want to render .viz
     file larger by default simply edit the *.viz file and change the
     line:
             size="8,10";

     to something larger. I.e:
    
             size="16,20";



Version History
---------------

  0.2.1:  - Improved code documentation
          - Single mutation significance cutoff ( SIGMATHRESH ) was 
            pre-calculated for Alkes Alu analysis and hardcoded.  This
            version calculates the correct sigma cutoff using the length
            of the input sequence.
          - Fixed bug with implementation of Siegel's pValue
            calculation which caused a segfault -- found by Neal Platt.
          - Switched default pvalue method to Andy Siegel's method and
            provided a new "-k" switch to use Alkes Price's method.
          - Fixed bug where the program was exiting when calculations 
            fell below the precision of the machine ( epsilon ). Message
            given was "Below epsilon..." and the runcoseg.pl script
            moved on even though coseg failed.

