Practical 3: Genotyping of polymorphic mobile elements in a TE graph-genome with GraffiTE

Tutorial Author: Clément Goubert (goubert.clement@gmail.com)

Last Update: Oct 31st, 2025

Background

TE insertion polymorphisms

ME1 ME2

Figure 1. Different use cases for pME.

shortreads

Figure 2. Overall strategies used for genotyping TE insertions using high-throughput sequence data. From Chen et al., 2022

Table 1. Main methods available for pME calling Methods

pangenomes

Figure 3. Pangenomes and Structural Variation.

Description

GraffiTE is a pipeline that finds polymorphic mobile elements insertions (pME) by comparing genome assemblies or long read datasets to a reference genome, and then genotypes the discovered polymorphisms in read sets using a graph-genome representation of the pME. GraffiTE is made as a toolbox, containing different modules that can be combined to make the best of the user's data. GraffiTE is build as a Nextflow pipeline and relies on a Singularity image to handle the dependencies.

The pipeline is divided in 4 main steps:

In this tutorial, we will demonstrate the basic usage of GraffiTE by searching for pME between the individual HG002 and the reference genome GRCh38.p14 (hg38). We will first compare a diploid assembly of HG002 to the reference genome hg38 to identify candidate pME. The list of candidate pME will be used to create a TE graph-genome, where each pME sequence (TE-present haplotype) is represented a bubble (alternative path) in the graph. To validate each candidate pME, we will then map short reads (Illumina) onto the TE graph-genome, and infer their genotypes (i.e. heterozygosity) by comparing the read support between TE-absent (short paths) and TE-present (bubbles, long path) in the graph.

Most of these steps are automated in GraffiTE, and could be done using a single command. For clarity, we will break down the analysis in two major commands: 1) de-novo search for candidate pME and 2) graph-genotyping of each pME. Finally, we will discuss the outputs and how to interpret them.

Command 1: detection of candidate pME between the diploid assembly of HG002 and the reference genome (steps 1 through 3)


Command 2: Graph-genotyping of the pME (step 4)

Objectives

At the end of this tutorial, you should be able to:

Protocol

Overview

In order to speed-up the analyses during this tutorial, we will focus on the chromosome 22 only.

The following workflow represent the main steps and command of the pipeline

You can click here to reveal a more detailed workflow

Detailed Workflow

Red paths indicate the workflow of the first command, while blue path indicate the workflow for the second command.

Input data

First, let's create a directory for this practical

mkdir practical3
cd practical3

The directory /home/ubuntu/Share/Practical_3/input contains all the files we will need for the execution of the program. We will first create a symlink to it, and check its content.

ln -s /home/ubuntu/Share/Practical_3/input/ input
tree input/
input/
├── HG002.Illumina.fastq
├── HG002.assemblies.csv
├── HG002.mat.cur.20211005_chr22.fasta
├── HG002.pat.cur.20211005_chr22.fasta
├── HG002.reads.csv
├── hg38.p14.chr22.fa
└── human_DFAM3.6.fasta
  1. Assemblies list: HG002.assemblies.csv

This file indicate which assemblies to use, their path and name

cat input/HG002.assemblies.csv
path,sample
/home/ubuntu/Share/Practical_3/input/HG002.mat.cur.20211005_chr22.fasta,HG002.mat
/home/ubuntu/Share/Practical_3/input/HG002.pat.cur.20211005_chr22.fasta,HG002.pat
  1. Reference genome: hg38.p14.chr22.fasta

  2. Reads file list: HG002.reads.csv

This file indicate which read set to use, their path and sample name In this case, there is only one read sample:

cat input/HG002.reads.csv
path,sample
/home/ubuntu/Share/Practical_3/input/HG002.Illumina.fastq,HG002

With short-reads, GraffiTE uses the program Pangenie which maps read k-mers to the graph genome. Pangenie requires the input reads to be within a single file, thus in this example, R1 and R2 sets were concatenated. Due to time constraint, we will use a small read set, but will explore the results produced with a real read set with 10X coverage.

  1. TE library: human_DFAM3.6.fasta

This file contains the consensus sequences for human repeats (from DFAM)

Step-by-step instructions

These commands are needed to pull the GraffiTE repository (to do only once!)

apptainer remote add --no-login SylabsCloud cloud.sycloud.io
apptainer remote use SylabsCloud

If you get FATAL: SylabsCloud is already a remote it's all good, this means this was already set up.

1. Search for pME candidates using genomic assemblies

nextflow run cgroza/GraffiTE --assemblies input/HG002.assemblies.csv --reference input/hg38.p14.chr22.fa --TE_library input/human_DFAM3.6.fasta --out results --mammal --genotype false --cpu 1 --stSort_t 1 --stSort_m "1G"

--stSort_t 1 --stSort_m "1G": here we restrict the memory usage to spare the server, but there no need to use them in a normal setting. You may see a warning about the Singularity cache directory, but this is not an issue.

While the program is running, we can look at the command in details:

Nextflow subdivides each step of the pipeline into processes. While the pipeline runs, the status of each process is indicated.

You can click here to reveal the command 1 divided by `Nextflow` process (rather than steps)

Command 1


Before going further we can take a brief look at the pipeline progress by following a variant through the annotation process (we will give a detailed description of the outputs in the [dedicated section](#outputs)):

2. Genotype candidate pME in a TE graph-genome with short-reads

nextflow run cgroza/GraffiTE  --graffite_vcf results/3_TSD_search/pangenome.vcf --reference input/hg38.p14.chr22.fa --reads input/HG002.reads.csv --out results

However we will not actually run the command in order to keep the server alive!

You can click here to reveal the command 2 divided by Nextflow process

Command 2

Outputs

Since we didn't actually run the genotyping command, we have generated a "real" run using 10X of short reads. Let create a symlink to these results so it will look as if you had run the complete analysis.

ln -s /home/ubuntu/Share/Practical_3/results/4_Genotyping/ results/

1. Output organization

GraffiTE will output the results in the ./tutorial directory (--out tutorial). Each step of the pipeline produces at least one file in variant call format (VCF):

tree results/
results/
├── 1_SV_search
│   ├── svim-asm_individual_VCFs
│   │   ├── HG002.mat.vcf   # INS/DEL for the maternal genome vs reference
│   │   └── HG002.pat.vcf   # INS/DEL for the paternal genome vs reference
│   ├── svim-asm_variants.vcf         # merged VCF containing all INS or DEL
│   └── vcfs.txt                      # list of each individual VCF giving their order in the "Support Vector"
├── 2_Repeat_Filtering
│   ├── genotypes_repmasked_filtered.vcf  # Filtered VCF containing INS or DEL with TE(s) >= 80% of the sequence
│   └── repeatmasker_dir                  # This directory contains the details of the RepeatMasker analysis
│       ├── ALL.onecode.elem_sorted.bak
│       ├── OneCode_LTR.dic
│       ├── indels.fa.cat
│       ├── indels.fa.masked
│       ├── indels.fa.onecode.out
│       ├── indels.fa.out
│       ├── indels.fa.out.length
│       ├── indels.fa.out.log.txt
│       ├── indels.fa.tbl
│       └── onecode.log
├── 3_TSD_search
│   ├── TSD_full_log.txt # Detailed test ouptout of the TSD search
│   ├── TSD_summary.txt  # Summary (1 line per pME) output of the TSD search
│   └── pangenome.vcf    # The fully annotated VCF with candidate pME and TSD(s).
└── 4_Genotyping
    ├── GraffiTE.merged.genotypes.vcf  # Final annotated VCF with individual genotypes (1 genotype column per read set)
    ├── HG002_genotyping.vcf.gz        # Individual VCF for each read set (here only one was used)
    └── HG002_genotyping.vcf.gz.tbi    # Index for each individual VCF

2. Genotypes' VCF

Let's explore in more detail the file 4_Genotyping/GraffiTE.merged.genotypes.vcf. VCFs files start with a header with each line starting with a #.

grep 'CHROM' results/4_Genotyping/GraffiTE.merged.genotypes.vcf
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	<sample>

In our case, there is only one <sample> column. If more read sets are used (for example for multiple individuals), there will be one <sample> column per read set.

The details for each column, and each field within columns can be found in the main documentation.

3. Output Example

To make sense of it, let's look at some entries in the output VCF. We will use the provided output file, generated with a complete set of reads (10X coverage)

chr22	10536797	HG002.pat.svim_asm.DEL.1	GAAATATTTTTCCCTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACAAGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGCGGTGGCAGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAAGCGGAGCTTGCAGTGAGCCGAGATTGCGCCACTGCAGTCCGCAGTCCGGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAA	G	. PASSUK=41;MA=0;AF=0.5;AK=41,0;CIEND=0,0;CIPOS=0,0;CHR2=chr22;END=10537125;SVLEN=-328;SVMETHOD=SURVIVOR1.0.7;SVTYPE=DEL;SUPP_VEC=11;SUPP=2;STRANDS=+-;n_hits=1;match_lengths=313;repeat_ids=AluYb8;matching_classes=SINE/Alu;fragmts=1;RM_hit_strands=+;RM_hit_IDs=298;total_match_length=313;total_match_span=0.951368;mam_filter_1=None;mam_filter_2=None;TSD=AATATTTTTCCCT,aatatttttccct	GT:GQ:GL:KC	1/1:10000:-109.113,-100.6,0:20

This case represent a "classic" pME in the human population. Using the INFO column flags, we can see that this is an element from the AluYb8 family (repeat_ids=AluYb8), which belongs to the SINE/Alu order (matching_classes=SINE/Alu). The polymorphism length is 328 bp (SVLEN=-328), and represent a deletion relative to the reference genome (SVTYPE=DEL). The recognized TE sequence covers 95% of the variant sequence (total_match_span=0.951368).

The insertion strand is + (RM_hit_strands=+). GraffiTE later found a TSD for it: AATATTTTTCCCT (TSD=AATATTTTTCCCT,aatatttttccct). The TSD field represent both 5' and 3' TSD in case there are some mismatches.

STRANDS=+- should be ignored.

This polymorphism was initially discovered in both the maternal and paternal genomes during the first step of the pipeline according to the support vector (SUPP_VEC=11). The order in which the samples are represented in the support vector is given by their order in the file 1_SV_search.vcfs.txt.

Finally, upon read mapping, the polymorphism is confirmed for each haplotype (GENOTYPE column: 1/1). In the genotype field, 0 represent the reference allele, and 1 the alternative allele (fields REF and ALT respectively). Because the SVTYPE= is a DEL (Deletion), this means that HG002 is homozygous for the absence of this Alu element.

chr22	10643867	HG002.pat.svim_asm.INS.3	T	TAAGAAAAATGTGGTTTTGATTTGCATTTCTCTGATGGCCAGTGATGATGAGCATTTCTTCATGTGTTTTTTGGCTGCATAAATGTCTTCTTTTGAGAAGTGTCTGTTCATGTCCTTCGCCCACTTTTTGATGGGGTTGTTTGTTTTTTTCTTGTAAATTTGTTTGAGTTCATTGTAGATTCTGGATATTAGCCCTTTGTCAGATGAGTAGGTTGCGAAAATTTTCTCCCATGTTGTAGGTTGCCTGTTCACTCTGATGGTAGTTTCTTTTGCTGTGCAGAAGCTCTTTAGTTTAATTAGAACTCATAGTTAAAACCACTATGAGATATCATCTCACACCAGTTAGAATGGCAATCATTAAAAAGTCAGGAAACAACAGGTGCTGGAGAGGATGCGGAGAAATAGGAACACTTTTACATTGTTGGTGGGACTGTAAACTAGTTCAACCATTGTGGAAGTCAGTGTGGCGATTCCTCAGGGATCTAGAACTAGAAATACCATTTGACCCAGCCATCCCATTACTGGGTATATACCCAAATGAGTATAAATCATGCTGCTATAAAGACACATGCACATGTATGTTTATTGCGGCACTATTCACAATAGCAAAGACTTGGAACCAACCCAAATGTCCAACAATGATAGACTGGATTAAGAAAATGTGGCACATATACACCATGGAATACTATGCAGACATAAAAAATGATGAGTTCATATCCTTTGTAGGGACATGGATGAAATTGGAAACCATCATTCTCAGTAAACTATCGCAAGAACAAAAAACCAAACACCGCATATTCTCACTCATAGGTGGGAATTGAACAATGAGATCACATGGACACAGGAAGGGGAATATCACACTCTGGGGACTGTGGTGGGGTTGGGGGAGGGGGGAGGGATAGCATTGGGAGATATACCTAATGCTAGATGACACATTAGTGGGTGCAGCGCACCAGCATGGCACATGTATACATATGTAACTAACCTGCACAATGTGCACATGTACCCTAAAACTTAGAGTATAATAAAAAAAAAAAAA	.	PASS	UK=92;MA=0;AF=0.5;AK=10,82;CIEND=-7,0;CIPOS=0,0;CHR2=chr22;END=10643867;SVLEN=1038;SVMETHOD=SURVIVOR1.0.7;SVTYPE=INS;SUPP_VEC=11;SUPP=2;STRANDS=+-;n_hits=2;match_lengths=296,731;repeat_ids=L1HS,L1HS;matching_classes=LINE/L1,LINE/L1;fragmts=1,1;RM_hit_strands=C,+;RM_hit_IDs=659,659;total_match_length=1027;total_match_span=0.98845;mam_filter_1=5P_INV:plus;mam_filter_2=None;TSD=AGAAAAAT,agaaaaat	GT:GQ:GL:KC	0/1:10000:-126.12,0,-72.03:16

This example is similar to the previous one, but with two major differences:

You can notice that in this particular case, GraffiTE reports n_hits=2. Indeed, when this type of insertion occurs, RepeatMasker -- the program used to identify TEs -- produces two consecutive hits for the same sequence (repeat_ids=L1HS,L1HS;matching_classes=LINE/L1,LINE/L1) however, we can see that the first hit appears to be on the complementary strand (C) while the second on the positive strand (+). In reality, the TE is inserted in the + strand, as we can see the poly-A tail in 3'. Because we used the argument --mammal, GraffiTE automatically search for this specific pattern and indicate mam_filter_1=5P_INV:plus.

Post-processing examples

As we have seen in the previous examples, it is important to form hypotheses about what type of pME we expect in order to reduce the number of false positives. In humans, only 3 superfamilies of TEs are known to be active: Alu, LINE-1 and SVA. Additionally, one family of endogenous retrovirus, HERV-K is known to display a small amount of polymorphisms between individuals. Other ERVs are fixed in the human lineage.

Furthermore, we need to take into account a special case for SVA elements. These elements contain a region with a variable number of tandem repeats (VNTR). Upon insertion, the VNTR region can later undergo changes in copy number, in a way comparable to microsatellites. Thus, even when a given individual is homozygous for the presence of an SVA insertion, one allele may be longer than the other due to variability in the VNTR region (see here for a clinical example). Using the --mammal argument, we asked GraffiTE to distinguish between insertion polymorphism (TE either present or absent) and VNTR polymorphism. These cases are reported with the flag mam_filter_2=VNTR_ONLY.

Accordingly, we will take these hypothesis into account while parsing the results.

1 AluSc8 SINE/Alu 1 AluSg4 SINE/Alu 1 AluSp SINE/Alu 1 AluSx3 SINE/Alu 1 AluSx4 SINE/Alu 2 AluSz SINE/Alu 1 AluSz6 SINE/Alu 1 AluY SINE/Alu 11 AluYa5 SINE/Alu 10 AluYb8 SINE/Alu 1 AluYh3 SINE/Alu 1 AluYi6 SINE/Alu 1 AluYk11 SINE/Alu 1 AluYk2 SINE/Alu 1 AluYk4 SINE/Alu 4 L1HS LINE/L1 1 L1HS,L1HS LINE/L1,LINE/L1 <-- notice here, 2 hits, one for the reverse (inversion), and 1 for the forward bit. 1 L1M4 LINE/L1 1 L1PA13 LINE/L1 1 L1PA3 LINE/L1 1 SVA_D Retroposon/SVA 2 SVA_F Retroposon/SVA


or summarized per superfamily:

```sh
grep -v '#' results/4_Genotyping/GraffiTE.merged.genotypes.vcf | \
 grep 'Alu\|L1\|SVA\|ERVK' | grep 'n_hits=1;\|5P_INV' | grep -v 'VNTR_ONLY' | \
 sed 's/repeat_ids=/\t/g;s/;matching_classes=/\t/g;s/;fra/\t/g' | \
 cut -f 10 | sort | uniq -c
      7 LINE/L1
      1 LINE/L1,LINE/L1
      3 Retroposon/SVA
     35 SINE/Alu

Cleaning the Nextflow work directory

Nextflow creates a work/ directory which includes all the intermediate files. This is useful to resume an interrupted job. When successful, it is recommended to remove this directory to save on storage.

rm -rf work/

🎉 Congratulations! You now know how to use GraffiTE!