Introduction to Pangenomics: Building a Genome-Graph with pggb

Pangenome 101

A pangenome is a collection of genomes belonging to a species or a group of species. By definition a pangenome represents the genetic diversity of the model studied and ought to replace the concept of reference genome

pangenome

In practice a pangenome can be as simple as a collection of fasta sequences. However, a collection of sequence by itself does not directly replace a reference genome. Thus, a new data structure is necessary: a graph-genome. In a graph-genome, shared segments between samples are collapsed into single paths (set of nodes), while variation between samples is represented by bubbles (alternative paths of nodes). The nodes of the graphs are connected by vertices.

graphgenome

Source Eizenga et al., 2020 – Ann. Rev.

A new collection of tools has been built to construct, manipulate and interpret graph-genomes. The graph genome by itself can be useful to study the genomic variation among a set of assemblies or haplotypes. Furthermore, it is also possible to use a graph-genome as a reference and genotype samples by mapping reads onto it. Graph-genomes have become the gold standard to study structural variation: indeed, by representing explicitly all the alleles present in the pangenome, read mapping is more accurate, especially for large or complex variants.

There are essentially two distinct ways to build a pangenome graph: either (i) by using a genome sequence as backbone or (ii) by comparing all the sequences to each other (all vs all comparison). The former is less computationally intensive and may be useful if the quality of the assemblies is heterogeneous, while the later may be less biased – and thus more representative – at the cost of more intensive calculations. In this workshop we will use pggb (PanGenome Graph Builder) a reference-free framework that combines cutting-edge software for graph building, smoothing, visualization, quality control, mapping and variant calling. It is fairly easy to use and can perform a lot of complex commands with a single job.

Data prep

In this workshop, we will be aligning 10 haplotypes of the MHC locus HLA-DQA1. MHC loci are highly polymorphic, with a lot of copy number and structural variants.

To work with pggb, it is highly recommended to name the input sequences according to the panSN nomenclature. Here are the headers of the input sequences:

>REFhg38#1#gi568815592
>COX#1#gi568815529
>DBB#1#gi568815551
>MANN#1#gi568815561
>MCF#1#gi568815564
>QBL#1#gi568815567
>SSTO#1#gi568815569
>Alt8#1#gi398303864
>CHM1#1#gi528476637
>HuRef#1#gi157734152

The format is: [sample_name] [separator] [haplotype] [separator] [seq_name]
The recommended separator character is #; [haplotype] is used to indicate the haplotype number for the sample. Here, we use 1 haplotype per sample, so this number is always 1. Finally, here [seq_name] is the GenBank accession; in practice, it may be a chromosome or contig number belonging to the specified [haplotype]. This notation is helpful for example for pggb to find a the path of a given assembly in the graph.

Running PGGB

# create a working directory and move within it
mkdir data
cd data
# download the toy dataset
wget https://www.repeatmasker.org/~cgoubert/UofA_Cyverse_Workshops/2024_Pangenome_Workshop/DQA1-3117.fa
# launch the docker container in interactive mode
docker run -it -v $(pwd):/data cosimichele/pggb
# use samtools to index the input fasta containing the haplotypes
samtools faidx /data/DQA1-3117.fa
# launch PGGB
pggb -i /data/DQA1-3117.fa -p 70 -s 500 -n 10 -t 4 -V 'REFhg38#1#gi568815592' -o /data/DQA1_pangenome -M
# -p, --map-pct-id PCT        percent identity for mapping/alignment [default: 90]
# -s, --segment-length N      segment length for mapping [default: 5000]
# -n, --n-haplotypes N        number of haplotypes
# -t, --threads N             number of compute threads to use in parallel steps [default: 8]`  
# -V, --vcf-spec SPEC         specify a set of VCFs to produce with SPEC = REF[:LEN][,REF[:LEN]]*
#                                the paths matching ^REF are used as a reference, while the sample haplotypes
#                                are derived from path names, assuming they match the PanSN; e.g. '-V chm13',
#                                a path named HG002#1#ctg would be assigned to sample HG002 phase 1.
#                                If LEN is specified and greater than 0, the VCFs are decomposed, filtering
#                                sites whose max allele length is greater than LEN. [default: off]

While the program runs, let’s look at the pggb workflow:

pggbwf

Source: Garrison et al., 2024

Pack the results and transfer to your machine:

tar -zcvf /data/DQA1_pangenome.tar.gz /data/DQA1_pangenome/ # create an archive of the results
exit # exit the docker container
exit # quit the VM
# On your local machine:
scp userID@IP:~/data/DQA1_pangenome.tar.gz . # copy the output to your machine
tar -xzvf DQA1_pangenome.tar.gz # extract the archive

Interpretation

Pangenome graph

The main output of our analysis is the graph-genome, stored in GFA format. Yours should looks like DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.gfa. The GFA can be further used to augment the pangenome (add new haplotypes to it), map reads, analyze variants and more…

pggb also automatically produces some visualizations, for example: DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.og.viz_multiqc.png:

viz1

We can already notice that the reference sequence (REFhg38) is twice as long as the other one. Also, it seems that the haplotype HuRef has both a large insertion and a large deletion relative to the other haplotypes.

Another useful visualisation is DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.og.viz_uncalled_multiqc.png. It indicates whether some haplotypes have uncalled bases (“N” residues):

viz2

We see that the large insertion in HuRef is actually a long stretch of uncalled bases (in green). Indeed, we can see it in its fasta sequence >HuRef#1#gi157734152.

Variant calling

When we launched the analysis, we provided the option -V 'REFhg38#1#gi568815592'. It asks pggb to produce a VCF file(https://samtools.github.io/hts-specs/VCFv4.2.pdf) (Variant Call Format), projecting the paths in the graph relative to a reference path, here REFhg38#1#gi568815592. This can be useful in order to anchor each variant in a common coordinate system, allowing comparison between analyses. VCF are also the main format to store genotypes after mapping reads to the pangenome, and can be used for downstream analyses (population genetics, association studies,…).

VCF files contains two main sections: - a header, with lines starting with a #. Header should provide details about the reference used, the command used to generate it, and details the flags present in the INFO field of each variant. - a body, which list 1 variant per line. This section has 9 columns plus 1 more column per sample (minus the reference).

Let’s look at the first variant for example:

grep -A1 '#CHROM' DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.REFhg38#1#gi568815592.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Alt8    CHM1    COX DBB HuRef   MANN    MCF QBL SSTO
REFhg38#1#gi568815592   22  >4>7    A   G   60  .   AC=4;AF=0.444444;AN=9;AT=>4>6>7,>4>5>7;NS=9;LV=0    GT  0   0   1   0   0   0   1   1   1

This variant is a SNP, positioned at the 22nd residue of the reference sequence REFhg38#1#gi568815592. The reference allele is A and the alternative allele is G. In the genotypes fields (columns 10+), 0 means “identical to the reference”, or allele 0, while 1 refers to the first alternative allele (there are only two alleles here, but there can be more). Note that since we used haplotype, there is only one allele per sample

Let’s now look at a more “interesting” allele. In my VCF, I have a large variant with the ID “>831>1116”. The command below isolate this line and format the line in a more explicit way:

awk '$3==">831>1116"' DQA1-3117.fa.16a4a17.11fba48.5180c7d.smooth.final.REFhg38#1#gi568815592.vcf | cut -f 4,5,8 | awk '{print "Reference sequence: "; print $1"\n"; gsub(",","\n\n", $2); print "Alt sequences: "; print $2"\n"; gsub(";","\n\n",$3); sub("AT=","AT=\n",$3); gsub(",>","\n\n>",$3); print $3}' | sed "s/^,//g;s/>/,/g"
Reference sequence:
CTACAGTACATTGTATCTGTTCCCTTACCTACCTGACTCTTCCACTATTCAGTTTGTTCCTTAATAGTAGACCATGCCTGATTGGTGTTTTACACATCTCCGGCTATGTCTGACACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACGATGGTAAAAGGCTTACACATGTCTTGACAAAAAAGTCCAGTTTGGCTCATTTGGCTGGAGTCGTACTGCATGGCTGCCATTCTGCTCTGGCATCCTCAGACAAGCACACTGCCCATTAGAGGAAAAAGTGTGATATAAGTGTTGAGTCAGAATGCTGTAGACATTTAGTAATCTCCTTCACAGGAAAAAAAAAAAAAAGGTGGGGGGAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTTTTCATTTCAGAAGAAATCAGTTGCTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTAGAAGGAGGCATTGTTTATTTATTATATATTTTGTAATGTTATTACCAATCTTGTTATACTCTTTCTTATACCCTACAATTGTTAGCAGAAATTATTTTAAATTAATAAGATCCTGCATGCTTTTCCTTTAAAAAAAAAAAAAAGAAAGATCTCTGTGTAGAGTGTCCTATTCTGAGCCAGTCCTGAGAGGAAAGGAAGTATAATCAATTTGTTATTAACCAATGAAAGAATTAAGTGAAAGATAAATCTCAGGAAGCAGAGGGAAGTAAACCTAATTTCTGACTAAGAAAGCTAAATACTATGATAACTCATTCATTCCTTCTTTTGTTCAATTACATTATTTAATCATAAGTCCATGACATGCCAGGCACTCAGGAAATAGTAAAAATTGGACATGCGATATTCTGCCCTTGTGTAGCGCACACTAGAGTGGGAAAGAAAGTGCACTTTTAACTGGACAACTACCAACATGAAGAGGGGAGGAAGCAGGGGCTGGAAATGTCCACAGACTG

Alt sequences:
CTACAGTACATTGTATCTGCTCCCTTACCTACCTGACTCTTCCACTCTTCAGTTTGTTCCTTAATGGTAGACCATGCCTGATTGGTGTTTTACACATCCCCTGCTATGTCTGACACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACAACGGTAAAAGGCTTACACGTCTTGACAGGAATGTCCAGTTCGGCTCATTTGGCTGGAGCCACATTGCACGGCTGCCATTCTGCTCTGGCATCCTCAGACAAGCACACTGCCCATTAGAGGAAAAAGGGTTTAATTTACCTGAGTCCTCAAGTGAATATAAGTGTTAAGTCAGAACACTGTAGACATTTAGTAACCTCCTTCAGAGGAAAAAAAAAGGTGGGGGGAAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTCTTCATTTCAGAAGAAATCAGTTGCTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTGAAGGAGGCATTGTTTATTTATTATATATTTTGTAATGTTATTACCATTCTTGTTATACTCTTCCTTATACCCTACAATTGTTAGCAGACATTATTTTAAATTAATAAGATCCTGCATGCTTTTCTTTTTTTTAAAAAAAGAAAGACCTCTGTGTAGAGTGTCCTGTTCTGAGCCAGTCCTGAGAGGAAAGAAAATACAATCAGTTTGTTATTAACTGATGAAAGAATTAAGTGAAAGATGAATCTTAGGAAGCAGAAGGAAGTAAACCTAATCTCTGACTAAGAAAGCTAAATACCATAATAACTCATTCATTCCTTCTTTTGTTCAATTACATTATTTAATCATAAGTCCGTGATGTGCCAGGCACTCAGGAAATAGTAAAAACTGGACATGTGATATTCTGCCCTTGTGTAGCGCACATTATAGTGGGAAAGAAAGCGCAATTTTAACCGGACAACTACCAACAATAAGAGTGGAGGAAGCAGGGGTTGGAAATGTCCACAGGCTG

CTACAGTACATTGTATCTGCTCCCTTACCTACCTGACTCTCCCACTATTCAGTTTGTTCCTTAATGGTAGACCATGCCTGATTGGTGTTTTACACTTCCCCTGCTATGTCTGATACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACGATGGTAAAAGGCTTACACATGTCTTGACCAGAATGTTCAGTTTGGCTCATTTGGCTGGAGTCATACTGCATGGCTGCCATTCTGCTCTGGCATCCTCAGAGAAGCACACTGCCCATTAGAGGAAAAAGGGTGAATATAAATGTTGAGTCAGAACACTGCAGACATTTAGTAACCTCCTTCAGAGGAAAAAAAGGGTGGGGGGAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTTTTTCATTTCAGAAGAAATCAGTTACTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTAGAAGGAAGCATTATTTATTTATCATATATTTTGTAATGTTATTACCGTTCTTGTTATACTCTTTCTTATACCCTACCATTGTTAGCAGAAATTATTTTAAATTAATAAGATCCTGCATGCTTTTCCTTTTTCTAAAAAAAGAAAGATCTCTGTGTAGAATGTCCTGTTCTGAGCCAGTCCTGAGAGGAAAGGAAGTATAATCAATTTGTTATTAACTGATGAAAGAATTAAGTGAAAGATAAACCTTAGGAAGCAGAGGGAAGTTAATCTATGACTAAGAAAGTTAAGTACTCTGATAACTCATTCATTCCTTCTTCTGTTCATTTACATTATTTAATCACAAGTCCATGATGTGCCAGGCACTCAGGAAATAGTGAAAATCGGACACGCGATATTCTGCCCTTGTGTAGCACACACTGTAGTGGGAAAGAAAGTGCACTTTTAACTGGACAACTATCAACACGAAGAGGGGAGGAAGCAGGGGCTGGAAATGTCCACAGACTT

CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNG

CTACAGTACATTGTATCTGCTCCCTTACCTACCTGACTCTCCCACTATTCAGTTTGTTCCTTAATGGTAGACCATGCCTGATCGGTGTTTTACACATCCCCTGCTATGTCTGATACTTGTGGATGCTCAGAAAGTGGGGAAGGAAGGAAAGATACGATGGTAAAAGGCTTACACATGTCTTGAGCAGAATGTTCAGTTTGGCTCATTTGGCTGGAGTCATACTGCATGGCTGCCATTCTGCTCTGGCATCCTCAGAGAAGCACACTGCCCATTAAAGGAAAAAGGGTGAATACAAATGTTGAGTCAGAACACTGCAGACATTTAGTAACCTCCTTCAGAGGAAAAAAAAAGGTGGGGGGAATGACAGAAATCCAAAAACTAGTAGAGCTTCCACTTTTTCATTTCAGAAGAAATCAGTTACTCTCCTCTAAGGACCATTACTATTAACAAAACAGAGACCTTAGAAGGAAGCATTATTTACTTATCATATATTTTGTAATGTTATTACCCTTCTTGTTATACTCTTTCTTATACCCTACCATTGTTAGCAGAAATTATTTTAAATTAATAAGATCCTGCATGCTTTTCCTTTTTCTAAAAAAAGAAAGATCTCTGTGTAGAATGTCCTGTTCTGAGCCAGTCCTGAGAGGAAAGGAAGTATAATCAATTTGTTATTAACTGATGAAAGAATTAAGTGAAAGATAAACCTTAGGAAGCAGAGGGAAGTTAATCTATGACTAAGAAAGTTAAGTACTCTGATAACTCATTCATTCCTTCTTTTGTTCATTTACATTATTTAATCACAAGTCTATGATGTGCCAGGCACTCAGGAAATAGTGAAAATTGGACACGCGATATTCTGCCCTTGTGTAGCACACACCGTAGTGGGAAAGAAAGTGCACTTTTAACCGGACAACTATCAACACGAAGAGGGGAGGAAGCAGGGGCTGGAAATGTCCACAGACTT

AC=2,2,1,2

AF=0.222222,0.222222,0.111111,0.222222

AN=9

AT=
,831,832,835,836,838,839,840,842,844,845,847,848,850,851,853,854,856,857,859,860,862,863,864,866,867,868,871,873,875,877,878,880,881,883,884,886,887,889,891,892,894,895,897,898,900,901,903,904,906,907,909,911,912,914,916,917,918,920,921,923,925,926,927,929,930,932,933,934,935,937,938,940,942,943,945,947,948,949,950,952,953,955,956,958,959,961,962,965,967,968,969,971,973,974,975,977,978,980,981,986,987,989,991,992,994,995,996,998,999,1001,1002,1004,1005,1007,1009,1010,1011,1013,1015,1016,1018,1019,1021,1022,1023,1024,1026,1027,1029,1030,1032,1033,1035,1036,1037,1040,1041,1042,1044,1046,1047,1049,1050,1052,1053,1055,1056,1058,1060,1061,1063,1064,1065,1068,1069,1071,1072,1073,1075,1077,1078,1079,1082,1084,1086,1087,1088,1090,1091,1093,1095,1096,1098,1099,1101,1103,1104,1105,1107,1108,1110,1111,1113,1114,1116

,831,832,834,836,838,839,841,842,843,845,847,848,850,851,852,854,855,857,859,860,861,863,865,866,868,871,872,874,875,876,878,880,881,882,884,885,887,888,890,892,893,895,896,898,900,901,903,904,905,907,908,910,911,912,914,916,917,919,920,922,923,925,926,928,929,931,932,934,935,937,938,939,940,942,944,945,947,948,950,952,953,955,956,958,959,961,962,965,966,968,970,971,973,974,976,977,979,980,982,984,985,986,988,989,991,992,993,995,997,998,1000,1001,1003,1004,1006,1007,1008,1010,1012,1013,1015,1016,1017,1019,1020,1022,1023,1024,1025,1027,1029,1030,1032,1033,1035,1036,1038,1040,1041,1043,1044,1046,1047,1049,1050,1052,1053,1055,1057,1058,1059,1061,1063,1064,1066,1068,1069,1071,1072,1074,1075,1077,1078,1080,1082,1084,1085,1087,1089,1090,1092,1093,1094,1096,1098,1099,1102,1104,1106,1107,1109,1110,1112,1113,1114,1116

,831,832,834,836,837,839,840,842,843,845,847,848,849,851,852,854,855,857,858,860,862,863,864,866,867,868,870,871,874,875,876,878,879,881,883,884,886,887,888,891,892,894,895,897,898,899,901,903,904,905,907,909,910,911,912,914,915,917,918,920,922,923,924,926,928,929,931,932,935,936,938,940,941,942,943,945,946,948,949,950,951,953,954,956,958,959,960,962,964,966,968,969,971,972,974,975,977,978,980,982,983,985,986,987,989,990,992,993,995,996,998,999,1001,1002,1004,1005,1007,1008,1010,1011,1013,1014,1016,1017,1019,1021,1022,1024,1025,1027,1028,1030,1031,1033,1034,1036,1037,1039,1041,1042,1044,1045,1047,1048,1050,1051,1053,1055,1056,1058,1059,1061,1062,1064,1065,1067,1069,1070,1072,1073,1075,1076,1078,1079,1082,1083,1085,1087,1088,1090,1091,1093,1095,1096,1097,1099,1100,1103,1104,1105,1107,1108,1110,1111,1113,1115,1116

,831,833,1114,1116

,831,832,834,836,837,839,840,842,843,845,846,848,850,851,852,854,855,857,858,860,862,863,864,866,867,868,869,871,874,875,876,878,879,881,883,884,886,887,888,891,892,894,895,897,898,899,901,902,904,905,907,909,910,911,913,914,915,917,918,920,922,923,924,926,928,929,931,932,934,935,937,938,940,941,942,943,945,946,948,949,950,951,953,954,956,957,959,960,962,963,966,968,969,971,972,974,975,977,978,980,982,983,985,986,987,989,990,992,993,995,996,998,999,1001,1002,1004,1005,1007,1008,1010,1011,1013,1014,1016,1017,1019,1021,1022,1024,1025,1027,1028,1030,1031,1033,1034,1036,1037,1039,1041,1042,1044,1046,1047,1048,1050,1051,1053,1054,1056,1058,1059,1061,1062,1064,1065,1068,1069,1070,1072,1073,1075,1076,1078,1079,1081,1083,1085,1087,1088,1090,1091,1093,1094,1096,1097,1099,1100,1103,1104,1105,1107,1108,1110,1111,1113,1115,1116

NS=9

LV=0

Visualization with Bandage

Bandage is a GFA visualizer, initially developed to represent assembly graphs, but nothing stops us from loading our little pangenome into it. A word of caution though, this is not recommended for large, chromosome level representation if there are too many nodes and vertices to represent. It is thus mainly useful when zooming on a specific locus (GFA can be subdivided in subgraph with gfatools)

Open Bandage, then click File > Load graph and select your .gfa. Next press Draw graph to produce a layout. You will probably have to zoom out to see the whole graph, and also increase the node width so you can see it clearly. If the layout is messy, you can generate a new one by click on Draw graph again. bandage_load

Let’s now zoom on the big bubble, this is our large variant in the VCF. We can now annotate each haplotype (each path) using the information provided in the VCF. We start with the haplotypes of the non-reference genome. We find them, after the reference path, in the AT field of the VCF. Earlier I replaced the “>” with commas, so we can directly copy-and-paste the path into Bandage (do not include the first comma). Finally, color the path of the reference genome. bandage_color

The yellow path is the one of HuRef and includes 898 “Ns”. This is probably because when this assembly was made, a longer sequence (based on the reference) might have been expected and thus the gap was padded with “Ns”. Ideally, these should be removed, creating a shortcut in the graph indicating a large deletion in HuRef relative to hg38.

We can zoom in the graph to reveal smaller variants: bandage_zoom