|
Interspersed Repeat Protein Masking Documentation |
DNA transposons 1188 entries, 573k amino acids
LINE-like elements 847 entries, 689k amino acids
LTR retrotransposons 1912 entries, 1733k amino acids
Rolling circles 52 entries, 61k amino acids
About 870 of these proteins are entries extracted from the nr GenBank databases. Most of the remaining peptides are translations of interspersed repeat consensus sequences. Many of the coding elements in repeat libraries contain frame shifts and stop codons reflecting errors in the consensus. In order to represent these elements currently some 660 interrupted coding regions have been translated by using TFASTY or GeneWise, guided by the closest related proteins in the database.
Very similar proteins are represented by only one sequence (e.g. there is only one HIV-1 pol entry). A cut-off of 90% similarity and/or 85% identity has been used, but if many differences appear to be due to ambiguities or errors, more distant matching proteins have been excluded.
The protein database is still very much under development but functions fine for the purpose of masking repetitive DNA that could give rise to spurious matches in translated database searches.
As most proteins are classified to the level RepeatMasker classifies transposable elements, it is also already useful for classifying transposable elements. This will be its central role in the interspersed repeat identification program that we are developing. In the future phylogenetic comparisons of transposable elements based on their coding regions can be performed using this database, though we will need to (and will) curate the database significantly before this can be done properly.
We will make the database available as part of the RepeatMasker package as soon as we're convinced that the most egregious errors and inconsistencies are eliminated.
A surprisingly large number of genes are derived from transposable element proteins. In 2001 I had identified 50 in the human genome and it is more than likely that other genomes are very similar if not more so, considering that most other genomes are exposed to a much larger variety of transposable elements and DNA transposons in general. So, be aware that some genes may be (partially) masked.