Introduction to Pangenomics: Building a Genome-Graph with pggb
Pangenome 101
A pangenome is a collection of genomes belonging to a species or a group of species. By definition a pangenome represents the genetic diversity of the model studied and ought to replace the concept of reference genome
In practice a pangenome can be as simple as a collection of fasta sequences. However, a collection of sequence by itself does not directly replace a reference genome. Thus, a new data structure is necessary: a graph-genome. In a graph-genome, shared segments between samples are collapsed into single paths (set of nodes), while variation between samples is represented by bubbles (alternative paths of nodes). The nodes of the graphs are connected by vertices.
A new collection of tools has been built to construct, manipulate and interpret graph-genomes. The graph genome by itself can be useful to study the genomic variation among a set of assemblies or haplotypes. Furthermore, it is also possible to use a graph-genome as a reference and genotype samples by mapping reads onto it. Graph-genomes have become the gold standard to study structural variation: indeed, by representing explicitly all the alleles present in the pangenome, read mapping is more accurate, especially for large or complex variants.
There are essentially two distinct ways to build a pangenome graph: either (i) by using a genome sequence as backbone or (ii) by comparing all the sequences to each other (all vs all comparison). The former is less computationally intensive and may be useful if the quality of the assemblies is heterogeneous, while the later may be less biased – and thus more representative – at the cost of more intensive calculations. In this workshop we will use pggb
(PanGenome Graph Builder) a reference-free framework that combines cutting-edge software for graph building, smoothing, visualization, quality control, mapping and variant calling. It is fairly easy to use and can perform a lot of complex commands with a single job.
Data prep
In this workshop, we will be aligning 10 haplotypes of the MHC locus HLA-DQA1. MHC loci are highly polymorphic, with a lot of copy number and structural variants.
To work with pggb
, it is highly recommended to name the input sequences according to the panSN nomenclature. Here are the headers of the input sequences:
>REFhg38#1#gi568815592
>COX#1#gi568815529
>DBB#1#gi568815551
>MANN#1#gi568815561
>MCF#1#gi568815564
>QBL#1#gi568815567
>SSTO#1#gi568815569
>Alt8#1#gi398303864
>CHM1#1#gi528476637
>HuRef#1#gi157734152
The format is: [sample_name] [separator] [haplotype] [separator] [seq_name]
The recommended separator character is #
; [haplotype]
is used to indicate the haplotype number for the sample. Here, we use 1 haplotype per sample, so this number is always 1
. Finally, here [seq_name] is the GenBank accession; in practice, it may be a chromosome or contig number belonging to the specified [haplotype]. This notation is helpful for example for pggb
to find a the path of a given assembly in the graph.
Running PGGB
# create a working directory and move within it
mkdir data
cd data
# download the toy dataset
wget https://www.repeatmasker.org/~cgoubert/UofA_Cyverse_Workshops/2024_Pangenome_Workshop/DQA1-3117.fa
# launch the docker container in interactive mode
docker run -it -v $(pwd):/data cosimichele/pggb
# use samtools to index the input fasta containing the haplotypes
samtools faidx /data/DQA1-3117.fa
# launch PGGB
pggb -i /data/DQA1-3117.fa -p 70 -s 500 -n 10 -t 4 -V 'REFhg38#1#gi568815592' -o /data/DQA1_pangenome -M
# -p, --map-pct-id PCT percent identity for mapping/alignment [default: 90]
# -s, --segment-length N segment length for mapping [default: 5000]
# -n, --n-haplotypes N number of haplotypes
# -t, --threads N number of compute threads to use in parallel steps [default: 8]`
# -V, --vcf-spec SPEC specify a set of VCFs to produce with SPEC = REF[:LEN][,REF[:LEN]]*
# the paths matching ^REF are used as a reference, while the sample haplotypes
# are derived from path names, assuming they match the PanSN; e.g. '-V chm13',
# a path named HG002#1#ctg would be assigned to sample HG002 phase 1.
# If LEN is specified and greater than 0, the VCFs are decomposed, filtering
# sites whose max allele length is greater than LEN. [default: off]
While the program runs, let’s look at the pggb
workflow:
Source: Garrison et al., 2024
Pack the results and transfer to your machine:
tar -zcvf /data/DQA1_pangenome.tar.gz /data/DQA1_pangenome/ # create an archive of the results
exit # exit the docker container
exit # quit the VM
# On your local machine:
scp userID@IP:~/data/DQA1_pangenome.tar.gz . # copy the output to your machine
tar -xzvf DQA1_pangenome.tar.gz # extract the archive
Interpretation
Pangenome graph
The main output of our analysis is the graph-genome, stored in GFA
format. Yours should looks like DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.gfa
. The GFA can be further used to augment the pangenome (add new haplotypes to it), map reads, analyze variants and more…
pggb
also automatically produces some visualizations, for example: DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.og.viz_multiqc.png
:
We can already notice that the reference sequence (REFhg38) is twice as long as the other one. Also, it seems that the haplotype
HuRef
has both a large insertion and a large deletion relative to the other haplotypes.
Another useful visualisation is DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.og.viz_uncalled_multiqc.png
. It indicates whether some
haplotypes have uncalled bases (“N” residues):
We see that the large insertion in
HuRef
is actually a long stretch of uncalled bases (in green). Indeed, we can see it in its fasta sequence>HuRef#1#gi157734152
.
Variant calling
When we launched the analysis, we provided the option -V 'REFhg38#1#gi568815592'
. It asks pggb
to produce a VCF file
(https://samtools.github.io/hts-specs/VCFv4.2.pdf) (Variant Call Format), projecting the paths in the graph relative to a reference path, here REFhg38#1#gi568815592
. This can be useful in order to anchor each variant in a common coordinate system, allowing comparison between analyses. VCF are also the main format to store genotypes after mapping reads to the pangenome, and can be used for downstream analyses (population genetics, association studies,…).
VCF files contains two main sections:
- a header, with lines starting with a #
. Header should provide details about the reference used, the command used to generate it, and details the flags present in the INFO
field of each variant.
- a body, which list 1 variant per line. This section has 9 columns plus 1 more column per sample (minus the reference).
Let’s look at the first variant for example:
grep -A1 '#CHROM' DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.REFhg38#1#gi568815592.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Alt8 CHM1 COX DBB HuRef MANN MCF QBL SSTO
REFhg38#1#gi568815592 22 >4>7 A G 60 . AC=4;AF=0.444444;AN=9;AT=>4>6>7,>4>5>7;NS=9;LV=0 GT 0 0 1 0 0 0 1 1 1
This variant is a SNP, positioned at the 22nd residue of the reference sequence
REFhg38#1#gi568815592
. The reference allele isA
and the alternative allele isG
. In the genotypes fields (columns 10+),0
means “identical to the reference”, or allele 0, while1
refers to the first alternative allele (there are only two alleles here, but there can be more). Note that since we used haplotype, there is only one allele per sample
- Numbers preceded with a “>” represent nodes in the graph.
AC
: allele count in genotypes, for each ALT allele, in the same order as listed (4)AF
: allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes (0.444444)AN
: total number of alleles in called genotypes (include the ref allele, 9)NS
: Number of samples with data (9, no missing data)LV
: Level in the snarl tree (0=top level). A snarl is a site of variation in the graph (a bubble). Snarls can be nested, and if the level is > 0, the parent snarl is indicated withPS
.AT
: Allele Traversal as path in graph. This is the path of each allele in the graph. We will use it later for visualization.
Let’s now look at a more “interesting” allele. In my VCF, I have a large variant with the ID
“>831>1116”. The command below isolate this line and format the line in a more explicit way:
awk '$3==">831>1116"' DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.REFhg38#1#gi568815592.vcf | cut -f 4,5,8 | awk '{print "Reference sequence: "; print $1"\n"; gsub(",","\n\n", $2); print "Alt sequences: "; print $2"\n"; gsub(";","\n\n",$3); sub("AT=","AT=\n",$3); gsub(",>","\n\n>",$3); print $3}' | sed "s/^,//g;s/>/,/g"
Reference sequence:
CTACAGTACATTGTATCTGTTCCCTTACCTACCTGACTCTTCCACTATTCAGTTTGTTCCTTAATAGTAGACCATGCCTGATTGGTGTTTTACACATCTCCGGCTATGTCTGACACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACGATGGTAAAAGGCTTACACATGTCTTGACAAAAAAGTCCAGTTTGGCTCATTTGGCTGGAGTCGTACTGCATGGCTGCCATTCTGCTCTGGCATCCTCAGACAAGCACACTGCCCATTAGAGGAAAAAGTGTGATATAAGTGTTGAGTCAGAATGCTGTAGACATTTAGTAATCTCCTTCACAGGAAAAAAAAAAAAAAGGTGGGGGGAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTTTTCATTTCAGAAGAAATCAGTTGCTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTAGAAGGAGGCATTGTTTATTTATTATATATTTTGTAATGTTATTACCAATCTTGTTATACTCTTTCTTATACCCTACAATTGTTAGCAGAAATTATTTTAAATTAATAAGATCCTGCATGCTTTTCCTTTAAAAAAAAAAAAAAGAAAGATCTCTGTGTAGAGTGTCCTATTCTGAGCCAGTCCTGAGAGGAAAGGAAGTATAATCAATTTGTTATTAACCAATGAAAGAATTAAGTGAAAGATAAATCTCAGGAAGCAGAGGGAAGTAAACCTAATTTCTGACTAAGAAAGCTAAATACTATGATAACTCATTCATTCCTTCTTTTGTTCAATTACATTATTTAATCATAAGTCCATGACATGCCAGGCACTCAGGAAATAGTAAAAATTGGACATGCGATATTCTGCCCTTGTGTAGCGCACACTAGAGTGGGAAAGAAAGTGCACTTTTAACTGGACAACTACCAACATGAAGAGGGGAGGAAGCAGGGGCTGGAAATGTCCACAGACTG
Alt sequences:
CTACAGTACATTGTATCTGCTCCCTTACCTACCTGACTCTTCCACTCTTCAGTTTGTTCCTTAATGGTAGACCATGCCTGATTGGTGTTTTACACATCCCCTGCTATGTCTGACACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACAACGGTAAAAGGCTTACACGTCTTGACAGGAATGTCCAGTTCGGCTCATTTGGCTGGAGCCACATTGCACGGCTGCCATTCTGCTCTGGCATCCTCAGACAAGCACACTGCCCATTAGAGGAAAAAGGGTTTAATTTACCTGAGTCCTCAAGTGAATATAAGTGTTAAGTCAGAACACTGTAGACATTTAGTAACCTCCTTCAGAGGAAAAAAAAAGGTGGGGGGAAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTCTTCATTTCAGAAGAAATCAGTTGCTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTGAAGGAGGCATTGTTTATTTATTATATATTTTGTAATGTTATTACCATTCTTGTTATACTCTTCCTTATACCCTACAATTGTTAGCAGACATTATTTTAAATTAATAAGATCCTGCATGCTTTTCTTTTTTTTAAAAAAAGAAAGACCTCTGTGTAGAGTGTCCTGTTCTGAGCCAGTCCTGAGAGGAAAGAAAATACAATCAGTTTGTTATTAACTGATGAAAGAATTAAGTGAAAGATGAATCTTAGGAAGCAGAAGGAAGTAAACCTAATCTCTGACTAAGAAAGCTAAATACCATAATAACTCATTCATTCCTTCTTTTGTTCAATTACATTATTTAATCATAAGTCCGTGATGTGCCAGGCACTCAGGAAATAGTAAAAACTGGACATGTGATATTCTGCCCTTGTGTAGCGCACATTATAGTGGGAAAGAAAGCGCAATTTTAACCGGACAACTACCAACAATAAGAGTGGAGGAAGCAGGGGTTGGAAATGTCCACAGGCTG
CTACAGTACATTGTATCTGCTCCCTTACCTACCTGACTCTCCCACTATTCAGTTTGTTCCTTAATGGTAGACCATGCCTGATTGGTGTTTTACACTTCCCCTGCTATGTCTGATACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACGATGGTAAAAGGCTTACACATGTCTTGACCAGAATGTTCAGTTTGGCTCATTTGGCTGGAGTCATACTGCATGGCTGCCATTCTGCTCTGGCATCCTCAGAGAAGCACACTGCCCATTAGAGGAAAAAGGGTGAATATAAATGTTGAGTCAGAACACTGCAGACATTTAGTAACCTCCTTCAGAGGAAAAAAAGGGTGGGGGGAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTTTTTCATTTCAGAAGAAATCAGTTACTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTAGAAGGAAGCATTATTTATTTATCATATATTTTGTAATGTTATTACCGTTCTTGTTATACTCTTTCTTATACCCTACCATTGTTAGCAGAAATTATTTTAAATTAATAAGATCCTGCATGCTTTTCCTTTTTCTAAAAAAAGAAAGATCTCTGTGTAGAATGTCCTGTTCTGAGCCAGTCCTGAGAGGAAAGGAAGTATAATCAATTTGTTATTAACTGATGAAAGAATTAAGTGAAAGATAAACCTTAGGAAGCAGAGGGAAGTTAATCTATGACTAAGAAAGTTAAGTACTCTGATAACTCATTCATTCCTTCTTCTGTTCATTTACATTATTTAATCACAAGTCCATGATGTGCCAGGCACTCAGGAAATAGTGAAAATCGGACACGCGATATTCTGCCCTTGTGTAGCACACACTGTAGTGGGAAAGAAAGTGCACTTTTAACTGGACAACTATCAACACGAAGAGGGGAGGAAGCAGGGGCTGGAAATGTCCACAGACTT
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNG
CTACAGTACATTGTATCTGCTCCCTTACCTACCTGACTCTCCCACTATTCAGTTTGTTCCTTAATGGTAGACCATGCCTGATCGGTGTTTTACACATCCCCTGCTATGTCTGATACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACGATGGTAAAAGGCTTACACATGTCTTGAGCAGAATGTTCAGTTTGGCTCATTTGGCTGGAGTCATACTGCATGGCTGCCATTCTGCTCTGGCATCCTCAGAGAAGCACACTGCCCATTAAAGGAAAAAGGGTGAATACAAATGTTGAGTCAGAACACTGCAGACATTTAGTAACCTCCTTCAGAGGAAAAAAAAAGGTGGGGGGAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTTTTTCATTTCAGAAGAAATCAGTTACTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTAGAAGGAAGCATTATTTACTTATCATATATTTTGTAATGTTATTACCCTTCTTGTTATACTCTTTCTTATACCCTACCATTGTTAGCAGAAATTATTTTAAATTAATAAGATCCTGCATGCTTTTCCTTTTTCTAAAAAAAGAAAGATCTCTGTGTAGAATGTCCTGTTCTGAGCCAGTCCTGAGAGGAAAGGAAGTATAATCAATTTGTTATTAACTGATGAAAGAATTAAGTGAAAGATAAACCTTAGGAAGCAGAGGGAAGTTAATCTATGACTAAGAAAGTTAAGTACTCTGATAACTCATTCATTCCTTCTTTTGTTCATTTACATTATTTAATCACAAGTCTATGATGTGCCAGGCACTCAGGAAATAGTGAAAATTGGACACGCGATATTCTGCCCTTGTGTAGCACACACCGTAGTGGGAAAGAAAGTGCACTTTTAACCGGACAACTATCAACACGAAGAGGGGAGGAAGCAGGGGCTGGAAATGTCCACAGACTT
AC=2,2,1,2
AF=0.222222,0.222222,0.111111,0.222222
AN=9
AT=
,831,832,835,836,838,839,840,842,844,845,847,848,850,851,853,854,856,857,859,860,862,863,864,866,867,868,871,873,875,877,878,880,881,883,884,886,887,889,891,892,894,895,897,898,900,901,903,904,906,907,909,911,912,914,916,917,918,920,921,923,925,926,927,929,930,932,933,934,935,937,938,940,942,943,945,947,948,949,950,952,953,955,956,958,959,961,962,965,967,968,969,971,973,974,975,977,978,980,981,986,987,989,991,992,994,995,996,998,999,1001,1002,1004,1005,1007,1009,1010,1011,1013,1015,1016,1018,1019,1021,1022,1023,1024,1026,1027,1029,1030,1032,1033,1035,1036,1037,1040,1041,1042,1044,1046,1047,1049,1050,1052,1053,1055,1056,1058,1060,1061,1063,1064,1065,1068,1069,1071,1072,1073,1075,1077,1078,1079,1082,1084,1086,1087,1088,1090,1091,1093,1095,1096,1098,1099,1101,1103,1104,1105,1107,1108,1110,1111,1113,1114,1116
,831,832,834,836,838,839,841,842,843,845,847,848,850,851,852,854,855,857,859,860,861,863,865,866,868,871,872,874,875,876,878,880,881,882,884,885,887,888,890,892,893,895,896,898,900,901,903,904,905,907,908,910,911,912,914,916,917,919,920,922,923,925,926,928,929,931,932,934,935,937,938,939,940,942,944,945,947,948,950,952,953,955,956,958,959,961,962,965,966,968,970,971,973,974,976,977,979,980,982,984,985,986,988,989,991,992,993,995,997,998,1000,1001,1003,1004,1006,1007,1008,1010,1012,1013,1015,1016,1017,1019,1020,1022,1023,1024,1025,1027,1029,1030,1032,1033,1035,1036,1038,1040,1041,1043,1044,1046,1047,1049,1050,1052,1053,1055,1057,1058,1059,1061,1063,1064,1066,1068,1069,1071,1072,1074,1075,1077,1078,1080,1082,1084,1085,1087,1089,1090,1092,1093,1094,1096,1098,1099,1102,1104,1106,1107,1109,1110,1112,1113,1114,1116
,831,832,834,836,837,839,840,842,843,845,847,848,849,851,852,854,855,857,858,860,862,863,864,866,867,868,870,871,874,875,876,878,879,881,883,884,886,887,888,891,892,894,895,897,898,899,901,903,904,905,907,909,910,911,912,914,915,917,918,920,922,923,924,926,928,929,931,932,935,936,938,940,941,942,943,945,946,948,949,950,951,953,954,956,958,959,960,962,964,966,968,969,971,972,974,975,977,978,980,982,983,985,986,987,989,990,992,993,995,996,998,999,1001,1002,1004,1005,1007,1008,1010,1011,1013,1014,1016,1017,1019,1021,1022,1024,1025,1027,1028,1030,1031,1033,1034,1036,1037,1039,1041,1042,1044,1045,1047,1048,1050,1051,1053,1055,1056,1058,1059,1061,1062,1064,1065,1067,1069,1070,1072,1073,1075,1076,1078,1079,1082,1083,1085,1087,1088,1090,1091,1093,1095,1096,1097,1099,1100,1103,1104,1105,1107,1108,1110,1111,1113,1115,1116
,831,833,1114,1116
,831,832,834,836,837,839,840,842,843,845,846,848,850,851,852,854,855,857,858,860,862,863,864,866,867,868,869,871,874,875,876,878,879,881,883,884,886,887,888,891,892,894,895,897,898,899,901,902,904,905,907,909,910,911,913,914,915,917,918,920,922,923,924,926,928,929,931,932,934,935,937,938,940,941,942,943,945,946,948,949,950,951,953,954,956,957,959,960,962,963,966,968,969,971,972,974,975,977,978,980,982,983,985,986,987,989,990,992,993,995,996,998,999,1001,1002,1004,1005,1007,1008,1010,1011,1013,1014,1016,1017,1019,1021,1022,1024,1025,1027,1028,1030,1031,1033,1034,1036,1037,1039,1041,1042,1044,1046,1047,1048,1050,1051,1053,1054,1056,1058,1059,1061,1062,1064,1065,1068,1069,1070,1072,1073,1075,1076,1078,1079,1081,1083,1085,1087,1088,1090,1091,1093,1094,1096,1097,1099,1100,1103,1104,1105,1107,1108,1110,1111,1113,1115,1116
NS=9
LV=0
Visualization with Bandage
Bandage
is a GFA
visualizer, initially developed to represent assembly graphs, but nothing stops us from loading our little pangenome into it. A word of caution though, this is not recommended for large, chromosome level representation if there are too many nodes and vertices to represent. It is thus mainly useful when zooming on a specific locus (GFA can be subdivided in subgraph with gfatools
)
Open Bandage
, then click File > Load graph
and select your .gfa
. Next press Draw graph
to produce a layout. You will probably have to zoom out to see the whole graph, and also increase the node width so you can see it clearly. If the layout is messy, you can generate a new one by click on Draw graph
again.
Let’s now zoom on the big bubble, this is our large variant in the VCF
. We can now annotate each haplotype (each path) using the information provided in the VCF. We start with the haplotypes of the non-reference genome. We find them, after the reference path, in the AT
field of the VCF. Earlier I replaced the “>” with commas, so we can directly copy-and-paste the path into Bandage
(do not include the first comma). Finally, color the path of the reference genome.
The yellow path is the one of
HuRef
and includes 898 “Ns”. This is probably because when this assembly was made, a longer sequence (based on the reference) might have been expected and thus the gap was padded with “Ns”. Ideally, these should be removed, creating a shortcut in the graph indicating a large deletion inHuRef
relative tohg38
.
We can zoom in the graph to reveal smaller variants: