Dfam Consensus Tool

Overview

The Dfam_consensus database is an open collection of Repetitive DNA consensus sequence models and corresponding seed alignments. It is directly compatible with the RepeatMasker program and any consensus-based search tools. It is freely available and distributed under the Creative Commons Zero ( "CC0" ) license. The dfamConsensusTool.pl script is a command-line utility to aid with submission of new families to the Dfam_consensus database and is distributed as part of the RepeatModeler package. This utility provides the following basic features:

Account Registration - Dfam_consensus submitters must have an account in order to submit families to the editors. The registration process is quick and can be conducted on the website ( http://www.repeatmasker.org/Dfam_consensus/#/login ) or through the dfamConsensus.pl tool itself.
Data Validation - The basic data format for Dfam/Dfam_consensus is a variant of the Stockholm format. This tool can be used to validate a particular Stockholm file prior to submission.
Data Submission - The primary role of this tool is to provide a reliable method for uploading a single curated family or a complete library of curated families to the database quickly and reliably.

The tool can be found in the RepeatModeler "util/" directory ( RepeatModeler open-1.0.9 and above ).

What is meant by "curated" families?

Dfam_consensus is a library of curated Transposable Element families. The definition of "curated" in this context is somewhat of a moving target. Currently it involves several tasks not performed or not performed well by off-the-shelf de-novo repeat identification programs. These include:

Generation of Full-Length Families - In many cases de-novo repeat discovery programs are not able to automatically produce full-length models for families. A curator must determine the true boundaries and extend these elements by hand. This also involves removing the duplicate fragments of the newly extended family from the de-novo generated library.
Dfam_consensus/RepBase Redundancy - New families should be checked against public databases ( current Dfam_consensus, and RepBase etc ) to ensure that duplicates are not submitted. Often the library for a new species will contain ancestral families that have been previously characterized and should not be duplicated.
LTR/Internal Segmentation - Long Terminal Repeat (LTR) sequences are frequently collapsed by de-novo repeat discovery programs into a single LTR with a fragment of the internal sequence to one side. It is largely a manual task to determine the boundaries of the LTR portion and internal portion of these elements.
Orientation - If the new family is related to something already well-characterized in Dfam_consensus, RepBase etc, the orientation of the new related-family should be oriented similarly.
Annotation - Aspects of repeat biology, when known should be included in the description of the new family. For example this may include: the Target Site Duplication (TSD) length, the relationship to known families or subfamlies, or the average divergence in the species.

Some research groups consider the above tasks perfunctory and will have many datasets ready for direct import into Dfam_consensus. Others may simply run RepeatModeler ( or other repeat discovery programs ) and use the resulting library as-is to mask ( or crudely annotate ) repetitive sequences. The later datasets, while not ready for Dfam_consensus are still very useful as the starting point for generating curated libraries. As such, we highly recommend making these datasets available to the public through the Transposable Element Raw Dataset Repository.

Tool Usage

Account Registration

Use of this tool requires an account at the Dfam_consensus site and can be initiated by using the tool itself. example:

./dfamConsensusTool.pl -register

Dfam_consensus Tool - version 1.0.9
-----------------------------------
Registration to submit to Dfam_consensus is a two step process:
  1. Submit your account preferences using the form below.
  2. Send an email to help@dfam.org to request "submitter" access
     to the system. Once this has been approved you will be able to
     begin uploading and viewing the status of sequences using this
     tool.
Email: mister_jones@gmail.com
Full Name: John Jones
Password: *******

Registration successful!  Now you can send an email
to "help@dfam.org" and request "submitter" access to
Dfam_consensus.

Data Validation

When the tool is used to submit data to Dfam_consensus several forms of data validation are performed prior to the upload. Validation can also be performed independently using the "-validate" option. This is useful when working with large datasets. example:

./dfamConsensusTool.pl -validate test.stk

Dfam_consensus Tool - version 1.0.9
-----------------------------------
File test.stk contains 1 families.
[test] Stockholm Format: valid

Logging into server [www.repeatmasker.org:10010] to further
validate the input file.  If you do not already have
an account you can register by running this tool with
the "-register" option.
Username: mister_jones@gmail.com
Password: *******

Login at Tue Apr  4 12:39:48 2017
[test] Identifier Length: valid
[test] Custom Identifiers: valid
[test] Unique Identifiers: valid
[test] NCBI Clade Names: valid

File passes validation

Data Submission

The primary operation of this tool is for the submission of data to the database. If you are an authorized submitter this option will upload the dataset to the Dfam_consensus database. When approved by the editors the sequences will appear in the next Dfam_consensus release.

example:

./dfamConsensusTool.pl -upload test.stk

Dfam_consensus Tool - version 1.0.9
-----------------------------------
File test.stk contains 1 families.
[test] Stockholm Format: valid

Logging into server [www.repeatmasker.org:10010]. If
you do not already have an account you can register
by running this tool with the "-register" option.
Username: submitter@test.com
Password: 

Login at Tue Apr  4 12:41:29 2017
[test] Identifier Length: valid
[test] Custom Identifiers: valid
[test] Unique Identifiers: valid
[test] NCBI Clade Names: valid

File passes validation
Working on Jumbo1
  - building consensus...
  - uploading to the server...

Data Submission Format

The dfamConsensusTool.pl script accepts a variant of the Stockholm format. This format is automatically generated by the new RepeatModeler program ( version 1.0.9 and above ) and can be easily generated from other formats using easy to write translation scripts.

Dfam_consensus Minimal Example

  # STOCKHOLM 1.0
  #=GF ID    Jumbo1
  #=GF DE    A very common DNA transposon with two deletion products Jumbo1A
  #=GF DE    and Jumbo1B. 
  #=GF SE    Predicted; RepeatModeler
  #=GF TP    Interspersed_Repeat;Unknown
  #=GF OC    Colius striatus
  #=GF SQ    2
  #=GC RF                         xxxx.xxx.x.x...x.x.x
  colStr1:KK551537:690782-691123  .GCT.GGGAT.G...CTGCG
  colStr1:KK551537:1038-2834      AGCTTGGG.TTGTACC.G.G
  //

Dfam_consensus Complex Example

  # STOCKHOLM 1.0
  #=GF ID    Jumbo1
  #=GF DE    A very common DNA transposon with two deletion products Jumbo1A
  #=GF DE    and Jumbo1B. 
  #=GF SE    Predicted; RepeatModeler
  #=GF TP    Interspersed_Repeat;Transposable_Element;DNA;TIR;hAT;Tip100
  #=GF OC    Eutheria
  #=GF OC    Metatheria
  #=GF RN    [1]
  #=GF RM    382832
  #=GF RN    [2]
  #=GF RM    28834
  #=GF DR    DPTEdb; af31b_dna; 
  #=GF SQ    2
  #=GC RF                         xxxx.xxx.x.x...x.x.x
  hg38:chr12:690782-691123        .GCT.GGGAT.G...CTGCG
  hg38:chr2:38282-38399           AGCTTGGG.TTGTACC.G.G
  //
  # STOCKHOLM 1.0
  #=GF ID    Jumbo2
  ... 
  //

Recommended features

  #=GF  
  
  Compulsory fields:
     ID   Identification:             One word name for family.
     DE   Definition:                 Short description of family.
     AU   Author:                     Authors of the entry.
     SE   Source of seed:             The source suggesting the seed members 
                                      belong to one family.
     TP   Type/Classification:        Classifcation of family -- Presently we use the
                                      new unified RepeatMasker classification 
                                      heirarchy.
     OC   Clade:                      Organism (clade, etc.) Multiple OC records are
                                      allowed.  [ valid NCBI Taxonomy names only ]
     SQ   Sequence:                   Number of sequences in alignment.
  
  Optional fields:
     RN   Reference Number:           Reference Number.
     RM   Reference PubMed:           Pubmed reference number.
     RT   Reference Title:            Reference Title. 
     RA   Reference Author:           Reference Author
     RL   Reference Location:         Journal location. 
     DR   Database Reference:         Reference to external database. 

  #=GC
  
  Optional fields:
     RF        Reference annotation   Often the consensus DNA is used as a reference
                                      or simple "x" for match columns and "." for 
                                      insert columns.

Sequence Identifiers

Sequence identifiers in a Dfam_consensue stockholm should contain at a minimum a globally identifiable sequence identifier and a sequence range. For example the NCBI identifier and range:

NC_000001.11:3823-3833

is sufficient to uniquely identify a range of bases on this publicly accesible sequence. Assembly/organism are easily determined using facilties at the NCBI website. For other types of identifiers it may be necessary to also identify the assembly from which the sequence is drawn. For example the UCSC "chr1" sequence identifier is not unique and requires the addition of the assembly to narrow it down:

hg19:chr1:3828-3839

The general format is:

[<assembly_identifier>:]<sequence_identifier>:<start_pos>-<end_pos>

Where:

assembly_identifier :  [Optional] Character string ( "-", ":", and whitespace are not allowed ).
sequence_identifier :  Character string ( "-", ":", and whitespace are not allowed ).
start_pos           :  Numeric start position 1-based. 
end_pos             :  Numeric end position (inclusive) 1-based.

The start/end positions are in increasing order for forward strand sequences and decreasing order for reverse strand. A single base range is therefore not permitted.

Institute for Systems Biology
This server is made possible by funding from the National Human Genome Research Institute (NIGRI grant # RO1 HG002939).