The Institute for Systems Biology DupMasker Download

Prerequisites
  1. Unix system with RepeatMasker 3.2.0 or higher installed
    • DupMasker is included in the RepeatMasker 3.2.0 (and higher) release. RepeatMasker installation instructions can be found here
  2. Sequence Search Engine
    DupMasker/RepeatMasker use a sequence search engine to perform their searches. Currently WUBlast is the only engine supported by both programs.
  3. Duplicon Database
    The duplicon database developed by Jiang Z. et al. is an essential component of this system. It is available for download from: dupliconlib-20080314.tar.gz

    Details on how this database was constructed may be found in:

    Jiang Z, Tang H, Ventura M, Cardone MF, Marques-Bonet T, She X, Pevzner PA, Eichler EE.
    "Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution."
    Nat Genet. 2007 Nov;39(11):1361-8. Epub 2007 Oct 7.

Installation
  1. Install DupMasker Database
    Download the duplicon database and unpack it in the RepeatMasker program directory.
    • cp dupliconlib-20080314.tar.gz /usr/local/RepeatMasker
    • cd /usr/local/RepeatMasker
    • gunzip dupliconlib-20080314.tar.gz
    • tar xvf dupliconlib-20080314.tar
    • rm dupliconlib-20080314.tar
Output Format
The *.duplicons file format mimics the RepeatMasker *.out file format ( which in turn is based on the cross_match file format ). The specific fields are described below:

Forward Strand Annotation:

 SW    perc perc perc qry   qry   qry  qry   subj           subj  subj subj 
 score div. del. ins. seq   begin end (left) seq            begin end (left) 
 ---------------------------------------------------------------------------
 2334  8.44 0.00 3.25 chr1  127   737 (8222) SD1132...       1    298 (14)     
Reverse Strand:
 SW    perc perc perc qry   qry   qry  qry     subj       subj  subj subj 
 score div. del. ins. seq   begin end (left) C seq       (left) end  begin  
 -------------------------------------------------------------------------
 2334  8.44 0.00 3.25 chr1  127   737 (8222) C SD1132...  (14)  298  1       
  • SW score = smith-waterman score of the match (complexity-adjusted )
  • perc div. = %substitutions in matching region.
  • perc del. = %deletions (in query seq rel to subject) in matching region.
  • perc ins. = %insertions (in query seq rel to subject) in matching region.
  • qry seq = id of query sequence.
  • qry begin = starting position of match in query sequence.
  • qry end = ending position of match in query sequence.
  • qry (left) = no. of bases in query sequence past the ending position of match (so 0 means that the match extended all the way to the end of the query sequence).
  • C = "C" match is found on the reverse strand
  • subj seq = id of the duplicon.
  • subj (left) = The remaining bases in (complement of) subject sequence prior to beginning of the match.
  • subj end = starting position of match in subject sequence (using top-strand numbering).
  • subj begin = ending position of match in subject sequence.
Example Run
In this example we first downloaded the 173kb sequence AC097264.4 from Genbank.
  • [DupMaskerPath]/DupMasker AC097264.4
The output generated:
  • AC097264.4.dupout - An intermediate file of results obtained by searching the masked sequence against the duplicons library.
  • AC097264.4.duplicons - The final output of fully-extended duplicons.
The duplicons file is then run through the dupliconToSVG.pl script:
  • [DupMaskerPath]/util/dupliconToSVG.pl AC097264.4.duplicons
To obtain a SVG graphical representation of the region:
  • AC097264.4.duplicons.1.svg: Note: Some browsers can view this file directly. In addition firefox can zoom in and out on SVG files.
Release Notes
RM-open-3.2.3
  • A missing parameter to wublast in WUBlastSearchEngine.pm caused DupMasker to produce negative values in the *.duplicons file. This bug was only manifested when the input file contained NCBI accession identifiers such as "gi|238332|gb|AC839293.1|".
  • Also in this release dupliconToSVG.pl was added to the RepeatMasker/util directory.
RM-open-3.2.2
  • Dupmasker now supports the GFF output format. Use the -gff option to generate a *.duplicons.gff file in addition to the *.duplicons file.
  • A new utility has been written to convert the *.duplicons file ( ex hg18-chr1.duplicons ) into a Scalable Vector Graphics visualization of the duplication blocks ( ex hg18-chr1.duplicons.1.svg ). This utility is available as a separate download ( now in Repeatmasker package -- see above ). Run the program to view the documentation.
RM-open-3.2.0
  • First release of DupMasker.
DupMasker is licensed under the Open Source License v2.1.
Institute for Systems Biology
This server is made possible by funding from the National Human Genome Research Institute (NIGRI grant # RO1 HG002939-01) 2003.