De-novo Detection and Annotation of Transposable Elements with `RepeatModeler2` and `RepeatMasker`

Author: Clément Goubert | Contact: cgoubert@arizona.edu | Last Update: 11/12/24

Goals:

Discover the transposable elements families present in the Saguaro genome (RepeatModeler2)
The TE and other repeats present in the genome assembly.

For this workshop we will use a toy dataset made of a single contig (~2.1 Mb) of the Saguaro (Carnegiea gigantea) genome assembly (Copetti et al., 2023).

Saguaro

Pipeline Overview

Pipeline

RepeatModeler2

RepeatModeler2 is a pipeline that search for repeat families in genomic assemblies. It uses a sampling and clustering approach to identify repeated sequence and assemble them into families. RepeatModeler2 also performs a classification step using homology to a database of known sequences.

Run RepeatModeler

mkdir data
docker run -it -v $(pwd)/data:/data dfam/tetools

make sure you are in the home directory of the container before running the command (type cd).

-v specify the directories to be connected between the container and the host machine. In that case, our ~/data directory will be accessible via /data in the container. Anything we write in /data in the container will be saved in ~/data of the host machine

-it indicates that we will use an interactive session; it is also possible to use a shell script with a list of commands to be run in the container.

cd /data
# download the genome file
wget https://www.repeatmasker.org/~cgoubert/UofA_Cyverse_Workshops/2024_TE_Workshop/Cgig_v2_SGP5p_21.fa
# build the index for the seach
BuildDatabase -name Saguaro Cgig_v2_SGP5p_21.fa
# run RepeatModeler with the LTR module
RepeatModeler -database Saguaro -LTRStruct -threads 2

This step should take ~10 minutes using 2 CPUs

At this stage, we now have an automatically classified “TE library”. RepeatMasker output the library in two formats: .fasta (consensus [average] sequence of each family) and .stk (each family is represented by a multiple sequence alignments of representative copies, in stockholm format)

Output files:

RM2ouput

We are primarily interested in Saguaro-families.fa and Saguaro-families.stk

less Saguaro-families.fa
less Saguaro-families.stk

.fasta libraries can be directly used with RepeatMasker .stk libraries are more representative of the intra-family diversity, can be converted back to .fasta and are important to save if you wish to deposit your newly build repeat library at DFAM

To have a quick count:

grep -c '>' Saguaro-families.fa

RepeatModeler2 is based on a random sampling of the input assembly, so results can slightly vary. In order to get reproducible results, one can use the flag -srand.

-srand #
        Optionally set the seed of the random number generator to a known
        value before the batches are randomly selected ( using Fisher Yates
        Shuffling ). This is only useful if you need to reproduce the sample
        choice between runs. This should be an integer number.

RepeatMasker

RepeatMasker is the leading tool used to annotate repeated elements on a genomic sequence. It performs a competitive search of TE families represented in a databse (or “TE library”) and use finely tuned search parameters optimized for Transposable Elements and repetitive DNA.

The default search engine of RepeatMasker is RMBlastn a more sensitive version of Blastn designed for TEs. Alternatively, RepeatMasker can search TEs using HMM profile models with nhmmr, based on the multiple sequence alignments of representative copies; this approch is slower but more sensitive; HMM models can be found at DFAM

Run RepeatMasker

RepeatMasker -s -a -gff -lib Saguaro-families.fa -pa 1 Cgig_v2_SGP5p_21.fa

-s is for “slow” search, which provide adequate sensitivity in most cases

-a request RepeatMasker to ouput alignments, which are necessary to create repeat landscapes (see below)

-pa indicate how many parallel instances of the aligner we whish to use. There is a catch: RepeatMasker has 2 main search tools: RMblastn (by default) and nhmmr. RMblastm automatically use 4 cores per process, while nhmmr will use 2 cores. Thus, -pa should be chosen based on the number of CPU available and the number of cores required per process. Here, with -pa 1, the minimum possible, RepeatMasker will use 4 cores for this run.

There is a lot of useful information in the built-in help of RepeatMasker, and you can check the extend of it by typing RepeatMasker -h or reading here.

Output files:

RMoutputs

Downstream analyses:

Summarize results per families

buildSummary.pl Cgig_v2_SGP5p_21.fa.out > Cgig_v2_SGP5p_21.fa.summary.txt

Create a “Repeat Landscape” (relative age of TE family)

calcDivergenceFromAlign.pl -s Cgig.divsum Cgig_v2_SGP5p_21.fa.align
createRepeatLandscape.pl -div Cgig.divsum -t "Saguaro repeat landscape" -g 2086968 > Saguaro.landscape.html

On a new terminal window

scp user@IP:~/data/Saguaro.landscape.html .

De-novo Detection and Annotation of Transposable Elements with `RepeatModeler2` and `RepeatMasker`

Goals:

Pipeline Overview

RepeatModeler2

Run RepeatModeler

Output files:

RepeatMasker

Run RepeatMasker

Output files:

Downstream analyses:

Summarize results per families

Create a “Repeat Landscape” (relative age of TE family)

See also

TE primers

ab-inito TE discovery pipelines

Curation of Repeat Libraries

TE databases

TE tools

De-novo Detection and Annotation of Transposable Elements with RepeatModeler2 and RepeatMasker

Goals:

Pipeline Overview

RepeatModeler2

Run RepeatModeler

Output files:

RepeatMasker

Run RepeatMasker

Output files:

Downstream analyses:

Summarize results per families

Create a “Repeat Landscape” (relative age of TE family)

See also

TE primers

ab-inito TE discovery pipelines

Curation of Repeat Libraries

TE databases

TE tools

De-novo Detection and Annotation of Transposable Elements with `RepeatModeler2` and `RepeatMasker`