De-novo Detection and Annotation of Transposable Elements with RepeatModeler2 and RepeatMasker

Author: Clément Goubert | Contact: cgoubert@arizona.edu | Last Update: 11/12/24

Goals:

For this workshop we will use a toy dataset made of a single contig (~2.1 Mb) of the Saguaro (Carnegiea gigantea) genome assembly (Copetti et al., 2023).

Saguaro

Pipeline Overview

Pipeline

RepeatModeler2

RepeatModeler2 is a pipeline that search for repeat families in genomic assemblies. It uses a sampling and clustering approach to identify repeated sequence and assemble them into families. RepeatModeler2 also performs a classification step using homology to a database of known sequences.

Run RepeatModeler

mkdir data
docker run -it -v $(pwd)/data:/data dfam/tetools

make sure you are in the home directory of the container before running the command (type cd).

-v specify the directories to be connected between the container and the host machine. In that case, our ~/data directory will be accessible via /data in the container. Anything we write in /data in the container will be saved in ~/data of the host machine

-it indicates that we will use an interactive session; it is also possible to use a shell script with a list of commands to be run in the container.

cd /data
# download the genome file
wget https://www.repeatmasker.org/~cgoubert/UofA_Cyverse_Workshops/2024_TE_Workshop/Cgig_v2_SGP5p_21.fa
# build the index for the seach
BuildDatabase -name Saguaro Cgig_v2_SGP5p_21.fa
# run RepeatModeler with the LTR module
RepeatModeler -database Saguaro -LTRStruct -threads 2

This step should take ~10 minutes using 2 CPUs

At this stage, we now have an automatically classified “TE library”. RepeatMasker output the library in two formats: .fasta (consensus [average] sequence of each family) and .stk (each family is represented by a multiple sequence alignments of representative copies, in stockholm format)

Output files:

RM2ouput

We are primarily interested in Saguaro-families.fa and Saguaro-families.stk

less Saguaro-families.fa
less Saguaro-families.stk

.fasta libraries can be directly used with RepeatMasker .stk libraries are more representative of the intra-family diversity, can be converted back to .fasta and are important to save if you wish to deposit your newly build repeat library at DFAM

To have a quick count:

grep -c '>' Saguaro-families.fa

RepeatModeler2 is based on a random sampling of the input assembly, so results can slightly vary. In order to get reproducible results, one can use the flag -srand.

-srand #
        Optionally set the seed of the random number generator to a known
        value before the batches are randomly selected ( using Fisher Yates
        Shuffling ). This is only useful if you need to reproduce the sample
        choice between runs. This should be an integer number.

RepeatMasker

RepeatMasker is the leading tool used to annotate repeated elements on a genomic sequence. It performs a competitive search of TE families represented in a databse (or “TE library”) and use finely tuned search parameters optimized for Transposable Elements and repetitive DNA.

The default search engine of RepeatMasker is RMBlastn a more sensitive version of Blastn designed for TEs. Alternatively, RepeatMasker can search TEs using HMM profile models with nhmmr, based on the multiple sequence alignments of representative copies; this approch is slower but more sensitive; HMM models can be found at DFAM

Run RepeatMasker

RepeatMasker -s -a -gff -lib Saguaro-families.fa -pa 1 Cgig_v2_SGP5p_21.fa

-s is for “slow” search, which provide adequate sensitivity in most cases

-a request RepeatMasker to ouput alignments, which are necessary to create repeat landscapes (see below)

-pa indicate how many parallel instances of the aligner we whish to use. There is a catch: RepeatMasker has 2 main search tools: RMblastn (by default) and nhmmr. RMblastm automatically use 4 cores per process, while nhmmr will use 2 cores. Thus, -pa should be chosen based on the number of CPU available and the number of cores required per process. Here, with -pa 1, the minimum possible, RepeatMasker will use 4 cores for this run.

There is a lot of useful information in the built-in help of RepeatMasker, and you can check the extend of it by typing RepeatMasker -h or reading here.

Output files:

RMoutputs

Downstream analyses:

Summarize results per families

buildSummary.pl Cgig_v2_SGP5p_21.fa.out > Cgig_v2_SGP5p_21.fa.summary.txt

Create a “Repeat Landscape” (relative age of TE family)

calcDivergenceFromAlign.pl -s Cgig.divsum Cgig_v2_SGP5p_21.fa.align
createRepeatLandscape.pl -div Cgig.divsum -t "Saguaro repeat landscape" -g 2086968 > Saguaro.landscape.html

On a new terminal window

scp user@IP:~/data/Saguaro.landscape.html .

See also

TE primers

ab-inito TE discovery pipelines

Curation of Repeat Libraries

TE databases

TE tools