The Practical Haplotype Graph (PHG) is a pan-genome database that is designed to impute genotypes from low-coverage random sequence data. It uses a graph of haplotypes to represent the genetic variation within a breeding program and can merge genotypes from whole genome sequence and marker technologies. The PHG takes a pan-genome approach to marker identification and relies on having a limited number of recombination events in each generation (Mace et al., 2009; Bouchet et al., 2017). By storing haplotypes and genomic variants for a given set of individuals, the PHG can be used as both a database representing program diversity and an imputation tool.
In building a PHG database, moderate- to high-coverage WGS reads are aligned to the reference genome (Figure 1A) and collapsed into common, consensus haplotypes based on sequence divergence (Figure 1B, C). These consensus haplotypes are important for the PHG because they reduce the complexity of the graph, make it possible to associate traits with shared haplotypes, and fill in (impute) missing sequence information, while maintaining unique haplotypes. Once created, consensus haplotypes can be used to predict genotypes for new individuals. Skim sequence data are aligned to consensus haplotypes to find the best path through the graph, and SNP variants from the predicted haplotype path can be written to a VCF file. The result is a set of genome-wide SNP variant calls for each taxon, imputed from skim sequence (Figure 1D). More information on building a PHG can be found at https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/Home.
We built a PHG for a sorghum breeding program using the 24 most-important breeding program founder lines. Using the sorghum PHG we demonstrated that PHG imputation accuracy exceeds that of Beagle for low-coverage skim sequence (Figure 2), and produces genotype calls that are accurate enough for genomic prediction. Prediction accuracies from PHG-imputed SNPs and haplotypes match accuracies of other genotyping technologies (Figure 3).
The PHG paper is in press at The Plant Genome. A previous version of it is on Biorxiv at https://www.biorxiv.org/content/10.1101/775221v2.
Figure 1: PHG database and haplotype creation. (A) WGS data are aligned to a reference genome and loaded into the PHG database. (B) A set of designated reference ranges is chosen and input data are condensed to produce consensus haplotypes at each reference range. Colors are used to indicate sequence similarity across taxa and only a single reference range is shown. In this range, 10 genotypes are condensed into 3 major consensus haplotypes, with minor differences within a consensus haplotype maintained as variant sites. (C) Unique consensus haplotypes are built for each reference range across the genome. Reference ranges are indicated as red regions of the black reference genome bar. (D) Low-coverage sequence data are aligned to the consensus haplotypes and a hidden Markov model links reference ranges across the genome to predict genome-wide haplotypes (black edges reflect HMM edge weights between haplotypes, with the most likely haplotype based on the low coverage sequence data highlighted as a bold, dashed line).
Figure 2: PHG haplotype (A) and SNP (B) error rates compared to GBS data with the 24-taxa founder PHG database (purple) and the 398-taxa diversity PHG database (green) for a random path through the PHG and a range of sequence coverages. Beagle SNP calling accuracy at each coverage level is included in B (blue). Solid red horizontal lines in both plots represent the error rate for a naïve imputation where the algorithm always imputes the major allele for the Chibas taxa. Black horizontal lines in B represent the best-case imputation results: minimum achievable error rate of the PHG (dotted), GBS (dot-dash), and Beagle (dashed).
Figure 3: 5-fold cross validation prediction accuracies for the Chibas training population (n=207, 10 iterations) are the same when using A) GBS, PHG SNPs with 0.1x and 0.01x sequence coverage, or PHG haplotypes. B) Prediction accuracies do not change when using rhAmpSeq markers alone or using rhAmpSeq markers with additional markers from the PHG, regardless of how many rhAmpSeq markers are included.