De-novo Detection and Annotation of Transposable Elements with RepeatModeler2
and RepeatMasker
Author: Clément Goubert | Contact: cgoubert@arizona.edu | Last Update: 11/12/24
Goals:
- Discover the transposable elements families present in the Saguaro genome (RepeatModeler2)
- The TE and other repeats present in the genome assembly.
For this workshop we will use a toy dataset made of a single contig (~2.1 Mb) of the Saguaro (Carnegiea gigantea) genome assembly (Copetti et al., 2023).
Pipeline Overview
RepeatModeler2
RepeatModeler2
is a pipeline that search for repeat families in genomic assemblies. It uses a sampling and clustering approach to identify repeated sequence and assemble them into families. RepeatModeler2 also performs a classification step using homology to a database of known sequences.
Run RepeatModeler
mkdir data
docker run -it -v $(pwd)/data:/data dfam/tetools
make sure you are in the home directory of the container before running the command (type
cd
).
-v
specify the directories to be connected between the container and the host machine. In that case, our~/data
directory will be accessible via/data
in the container. Anything we write in/data
in the container will be saved in~/data
of the host machine
-it
indicates that we will use an interactive session; it is also possible to use a shell script with a list of commands to be run in the container.
cd /data
# download the genome file
wget https://www.repeatmasker.org/~cgoubert/UofA_Cyverse_Workshops/2024_TE_Workshop/Cgig_v2_SGP5p_21.fa
# build the index for the seach
BuildDatabase -name Saguaro Cgig_v2_SGP5p_21.fa
# run RepeatModeler with the LTR module
RepeatModeler -database Saguaro -LTRStruct -threads 2
This step should take ~10 minutes using 2 CPUs
At this stage, we now have an automatically classified “TE library”. RepeatMasker output the library in two formats: .fasta
(consensus [average] sequence of each family) and .stk
(each family is represented by a multiple sequence alignments of representative copies, in stockholm
format)
Output files:
We are primarily interested in Saguaro-families.fa
and Saguaro-families.stk
less Saguaro-families.fa
less Saguaro-families.stk
.fasta
libraries can be directly used with RepeatMasker.stk
libraries are more representative of the intra-family diversity, can be converted back to.fasta
and are important to save if you wish to deposit your newly build repeat library at DFAM
To have a quick count:
grep -c '>' Saguaro-families.fa
RepeatModeler2 is based on a random sampling of the input assembly, so results can slightly vary. In order to get reproducible results, one can use the flag -srand
.
-srand #
Optionally set the seed of the random number generator to a known
value before the batches are randomly selected ( using Fisher Yates
Shuffling ). This is only useful if you need to reproduce the sample
choice between runs. This should be an integer number.
RepeatMasker
RepeatMasker
is the leading tool used to annotate repeated elements on a genomic sequence. It performs a competitive search of TE families represented in a databse (or “TE library”) and use finely tuned search parameters optimized for Transposable Elements and repetitive DNA.
The default search engine of RepeatMasker is RMBlastn
a more sensitive version of Blastn
designed for TEs. Alternatively, RepeatMasker can search TEs using HMM profile models with nhmmr
, based on the multiple sequence alignments of representative copies; this approch is slower but more sensitive; HMM models can be found at DFAM
Run RepeatMasker
RepeatMasker -s -a -gff -lib Saguaro-families.fa -pa 1 Cgig_v2_SGP5p_21.fa
-s
is for “slow” search, which provide adequate sensitivity in most cases
-a
request RepeatMasker to ouput alignments, which are necessary to create repeat landscapes (see below)
-pa
indicate how many parallel instances of the aligner we whish to use. There is a catch: RepeatMasker has 2 main search tools:RMblastn
(by default) andnhmmr
.RMblastm
automatically use 4 cores per process, whilenhmmr
will use 2 cores. Thus,-pa
should be chosen based on the number of CPU available and the number of cores required per process. Here, with-pa 1
, the minimum possible, RepeatMasker will use 4 cores for this run.
There is a lot of useful information in the built-in help of RepeatMasker, and you can check the extend of it by typing RepeatMasker -h
or reading here.
Output files:
Downstream analyses:
Summarize results per families
buildSummary.pl Cgig_v2_SGP5p_21.fa.out > Cgig_v2_SGP5p_21.fa.summary.txt
Create a “Repeat Landscape” (relative age of TE family)
calcDivergenceFromAlign.pl -s Cgig.divsum Cgig_v2_SGP5p_21.fa.align
createRepeatLandscape.pl -div Cgig.divsum -t "Saguaro repeat landscape" -g 2086968 > Saguaro.landscape.html
On a new terminal window
scp user@IP:~/data/Saguaro.landscape.html .
See also
TE primers
- A Field Guide to Eukaryotic Transposable Elements
- A unified classification system for eukaryotic transposable elements
ab-inito TE discovery pipelines
REPET
(great alternative to RepeatModeler2 and RepeatMasker)EarlGrey
(based on RepeatModeler2)EDTA
(based on RepeatModeler, some reported issues with classification)HiTE
(new, not tested sorry!)
Curation of Repeat Libraries
- A begginer’s guide to manual curation of transposable elements
- Curation Guidelines for de novo Generated Transposable Element Families
- MCHelper
- TEtrimmer