Overview
The Dfam_consensus database is an open collection of Repetitive DNA consensus sequence models and corresponding seed alignments. It is directly compatible with the RepeatMasker program and any consensus-based search tools. It is freely available and distributed under the Creative Commons Zero ( "CC0" ) license. The dfamConsensusTool.pl script is a command-line utility to aid with submission of new families to the Dfam_consensus database and is distributed as part of the RepeatModeler package. This utility provides the following basic features:- Account Registration - Dfam_consensus submitters must have an account in order to submit families to the editors. The registration process is quick and can be conducted on the website ( http://www.repeatmasker.org/Dfam_consensus/#/login ) or through the dfamConsensus.pl tool itself.
- Data Validation - The basic data format for Dfam/Dfam_consensus is a variant of the Stockholm format. This tool can be used to validate a particular Stockholm file prior to submission.
- Data Submission - The primary role of this tool is to provide a reliable method for uploading a single curated family or a complete library of curated families to the database quickly and reliably.
What is meant by "curated" families?
Dfam_consensus is a library of curated Transposable Element families. The definition of "curated" in this context is somewhat of a moving target. Currently it involves several tasks not performed or not performed well by off-the-shelf de-novo repeat identification programs. These include:- Generation of Full-Length Families - In many cases de-novo repeat discovery programs are not able to automatically produce full-length models for families. A curator must determine the true boundaries and extend these elements by hand. This also involves removing the duplicate fragments of the newly extended family from the de-novo generated library.
- Dfam_consensus/RepBase Redundancy - New families should be checked against public databases ( current Dfam_consensus, and RepBase etc ) to ensure that duplicates are not submitted. Often the library for a new species will contain ancestral families that have been previously characterized and should not be duplicated.
- LTR/Internal Segmentation - Long Terminal Repeat (LTR) sequences are frequently collapsed by de-novo repeat discovery programs into a single LTR with a fragment of the internal sequence to one side. It is largely a manual task to determine the boundaries of the LTR portion and internal portion of these elements.
- Orientation - If the new family is related to something already well-characterized in Dfam_consensus, RepBase etc, the orientation of the new related-family should be oriented similarly.
- Annotation - Aspects of repeat biology, when known should be included in the description of the new family. For example this may include: the Target Site Duplication (TSD) length, the relationship to known families or subfamlies, or the average divergence in the species.
Tool Usage
Account RegistrationUse of this tool requires an account at the Dfam_consensus site and can be initiated by using the tool itself. example:
./dfamConsensusTool.pl -register Dfam_consensus Tool - version 1.0.9 ----------------------------------- Registration to submit to Dfam_consensus is a two step process: 1. Submit your account preferences using the form below. 2. Send an email to help@dfam.org to request "submitter" access to the system. Once this has been approved you will be able to begin uploading and viewing the status of sequences using this tool. Email: mister_jones@gmail.com Full Name: John Jones Password: ******* Registration successful! Now you can send an email to "help@dfam.org" and request "submitter" access to Dfam_consensus.Data Validation
When the tool is used to submit data to Dfam_consensus several forms of data validation are performed prior to the upload. Validation can also be performed independently using the "-validate" option. This is useful when working with large datasets. example:
./dfamConsensusTool.pl -validate test.stk Dfam_consensus Tool - version 1.0.9 ----------------------------------- File test.stk contains 1 families. [test] Stockholm Format: valid Logging into server [www.repeatmasker.org:10010] to further validate the input file. If you do not already have an account you can register by running this tool with the "-register" option. Username: mister_jones@gmail.com Password: ******* Login at Tue Apr 4 12:39:48 2017 [test] Identifier Length: valid [test] Custom Identifiers: valid [test] Unique Identifiers: valid [test] NCBI Clade Names: valid File passes validationData Submission
The primary operation of this tool is for the submission of data to the database. If you are an authorized submitter this option will upload the dataset to the Dfam_consensus database. When approved by the editors the sequences will appear in the next Dfam_consensus release.
example:
./dfamConsensusTool.pl -upload test.stk Dfam_consensus Tool - version 1.0.9 ----------------------------------- File test.stk contains 1 families. [test] Stockholm Format: valid Logging into server [www.repeatmasker.org:10010]. If you do not already have an account you can register by running this tool with the "-register" option. Username: submitter@test.com Password: Login at Tue Apr 4 12:41:29 2017 [test] Identifier Length: valid [test] Custom Identifiers: valid [test] Unique Identifiers: valid [test] NCBI Clade Names: valid File passes validation Working on Jumbo1 - building consensus... - uploading to the server...
Data Submission Format
The dfamConsensusTool.pl script accepts a variant of the Stockholm format. This format is automatically generated by the new RepeatModeler program ( version 1.0.9 and above ) and can be easily generated from other formats using easy to write translation scripts.Dfam_consensus Minimal Example
# STOCKHOLM 1.0 #=GF ID Jumbo1 #=GF DE A very common DNA transposon with two deletion products Jumbo1A #=GF DE and Jumbo1B. #=GF SE Predicted; RepeatModeler #=GF TP Interspersed_Repeat;Unknown #=GF OC Colius striatus #=GF SQ 2 #=GC RF xxxx.xxx.x.x...x.x.x colStr1:KK551537:690782-691123 .GCT.GGGAT.G...CTGCG colStr1:KK551537:1038-2834 AGCTTGGG.TTGTACC.G.G //Dfam_consensus Complex Example
# STOCKHOLM 1.0 #=GF ID Jumbo1 #=GF DE A very common DNA transposon with two deletion products Jumbo1A #=GF DE and Jumbo1B. #=GF SE Predicted; RepeatModeler #=GF TP Interspersed_Repeat;Transposable_Element;DNA;TIR;hAT;Tip100 #=GF OC Eutheria #=GF OC Metatheria #=GF RN [1] #=GF RM 382832 #=GF RN [2] #=GF RM 28834 #=GF DR DPTEdb; af31b_dna; #=GF SQ 2 #=GC RF xxxx.xxx.x.x...x.x.x hg38:chr12:690782-691123 .GCT.GGGAT.G...CTGCG hg38:chr2:38282-38399 AGCTTGGG.TTGTACC.G.G // # STOCKHOLM 1.0 #=GF ID Jumbo2 ... //Recommended features
#=GF Compulsory fields: ID Identification: One word name for family. DE Definition: Short description of family. AU Author: Authors of the entry. SE Source of seed: The source suggesting the seed members belong to one family. TP Type/Classification: Classifcation of family -- Presently we use the new unified RepeatMasker classification heirarchy. OC Clade: Organism (clade, etc.) Multiple OC records are allowed. [ valid NCBI Taxonomy names only ] SQ Sequence: Number of sequences in alignment. Optional fields: RN Reference Number: Reference Number. RM Reference PubMed: Pubmed reference number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. DR Database Reference: Reference to external database. #=GC Optional fields: RF Reference annotation Often the consensus DNA is used as a reference or simple "x" for match columns and "." for insert columns.Sequence Identifiers
Sequence identifiers in a Dfam_consensue stockholm should contain at a minimum a globally identifiable sequence identifier and a sequence range. For example the NCBI identifier and range:
- NC_000001.11:3823-3833
- hg19:chr1:3828-3839
- [<assembly_identifier>:]<sequence_identifier>:<start_pos>-<end_pos>
assembly_identifier : [Optional] Character string ( "-", ":", and whitespace are not allowed ). sequence_identifier : Character string ( "-", ":", and whitespace are not allowed ). start_pos : Numeric start position 1-based. end_pos : Numeric end position (inclusive) 1-based.The start/end positions are in increasing order for forward strand sequences and decreasing order for reverse strand. A single base range is therefore not permitted.
This server is made possible by funding from the National Human Genome Research Institute (NIGRI grant # RO1 HG002939).