Calling pangenes from plant genome alignments confirms presence-absence variation

Consistent gene annotation in crops is becoming harder as genomes for new cultivars are frequently published. Gene sets from recently sequenced accessions have different gene identifiers to those on the reference accession, and might be of higher quality due to technical advances. For these reasons there is a need to define pangenes, which represent all known syntenic orthologues for a gene model and can be linked back to the original annotation sources. A pangene set effectively summarizes our current understanding of the coding potential of a crop and can be used to inform gene model annotation in new cultivars. Here we present an approach (get_pangenes) to identify and analyze pangenes that is not biased towards the reference annotation. The method involves computing Whole Genome Alignments (WGA), which are used to estimate gene model overlaps. After a benchmark on Arabidopsis, rice, wheat and barley datasets, we find that minimap2 performs better than the GSAlign WGA algorithm. Our results show that pangenes recapitulate known phylogeny-based orthologies while adding extra core gene models in rice. More importantly, get_pangenes can also produce clusters of genome segments (gDNA) that overlap with gene models annotated in other cultivars. By lifting-over CDS sequences, gDNA clusters can help refine gene models across individuals and confirm or reject observed gene Presence-Absence Variation. A collection of flowering-related genes from the barley pangenome are discussed in detail. Documentation and source code are available at https://github.com/Ensembl/plant-scripts.

syntenic orthologues for a gene model and can be linked back to the original annotation sources. A pangene set effectively summarizes our current understanding of the coding potential of a crop and can be used to inform gene model annotation in new cultivars. Here we present an approach (get_pangenes) to identify and analyze pangenes that is not biased towards the reference annotation. The method involves computing Whole Genome Alignments (WGA), which are used to estimate gene model overlaps. After a benchmark on Arabidopsis, rice, wheat and barley datasets, we find that minimap2 performs better than the GSAlign WGA algorithm. Our results show that pangenes recapitulate known phylogeny-based orthologies while adding extra core gene models in rice. More importantly, get_pangenes can also produce clusters of genome segments (gDNA) that overlap with gene models annotated in other cultivars. By lifting-over CDS sequences, gDNA clusters can help refine gene models across individuals and confirm or reject observed gene Presence-Absence Variation. A collection of flowering-related genes from the barley pangenome are discussed in detail. Documentation and source code are available at https://github.com/Ensembl/plant-scripts.

INTRODUCTION
For a growing number of crops and plants there are now multiple genome assemblies available in public repositories. These data are driving the analysis of the pangenome, the union of all known genomes of a species. For instance, recently published pangenome reports include staple crops wheat and barley (Jayakodi et al., 2020;Walkowiak et al., 2020). While these efforts have greatly advanced our understanding of the variability of genomes within species, they have also prompted a new class of problems, those related to annotating and naming genes across cultivars. Different strategies are possible. For instance, the barley pangenome consortium lifted-over gene models from three genotypes (Morex, Barke, HOR10350) to all other assemblies. This procedure biases the gene space to that of the reference cultivars. In contrast, in other species, fresh gene annotations have been produced for different individuals or sampled populations . In this case, care should be taken to follow the same annotation protocols throughout to avoid inflating the number of population-specific genes (Weisman et al., 2022) or to conserve gene identifiers.
In this context it is useful to define a pangene, a gene model or allele found in some or all individuals of a species in a similar genomic location. A pangene should integrate additional naming schemes, e.g. so that a cluster of gene models can share a common identifier that links back to their original gene identifiers. A pangene set defines our current understanding of the total coding potential of a species and can assist in gene model curation, by providing a pool of possible gene models for assessment.
Pangenes can be produced by a variety of approaches, such as iterative mapping and assembly (Golicz et al., 2015), local alignments of nucleotide sequences , molecular phylogenies of chromosome-sorted proteins (Lovell et al., 2022) or as a secondary product of genome graphs (Sheikhizadeh et al., 2016;Li et al., 2020;Guarracino et al., 2022). Whatever the approach, a common use case for pangenes is to capture Presence Absence Variation (PAV) at the gene level. However, previous work has observed that absent gene models are rarely caused by complete sequence deletions; instead, they might not be expressed in certain conditions or the underlying genomic regions might contain genetic variants such that the criteria for calling a gene model are not satisfied. For instance, sequence variants in introns or splice sites can reduce evidence for a gene model (Lovell et al., 2021). Mapping transcript isoforms from orthologous loci is also a useful way to determine whether a gene model is intact (Kirilenko et al., 2023).
Here we present an approach to identify and analyze pangenes in sets of plant genomes which can explicitly confirm or reject PAVs by lifting-over viable gene models on candidate genomic segments. This approach requires computing pairwise Whole Genome Alignments (WGAs), which are then used to estimate gene model overlaps across individuals. Finally, pairs of overlapping genes are iteratively merged to produce pangene clusters. The algorithm produces pangene clusters that are not biased towards the reference annotation and that can optionally be used to refine individual gene model annotation with information from all cultivars. We benchmark this approach on diverse datasets that cover monocots and dicots, as well as small and large genomes.

Genome sequences and genesets
A total of four datasets were used in this study (Arabidopsis ACK2, rice3, chr1wheat10 and barley20) of increasing size and complexity. They are listed in Table 1, where ACK stands for the Ancestral Crucifer Karyotype (Lysak et al., 2016) and chr1wheat10 for chromosome 1 in ten different hexaploid wheats. Note that ACK2 and rice3 include different species (Arabidopsis lyrata and Oryza nivara); the others include cultivars of the same species. The ranges of Average Nucleotide Identity (ANI) among genomes in each dataset are indicated in Table 3.

Protocol for calling pangene clusters based on WGA evidence
The repository https://github.com/Ensembl/plant-scripts/tree/master/pangenes contains the documentation and source code for calling pangenes. The main script (get_pangenes.pl), illustrated in Figure 1, sequentially runs the scripts _cut_sequences.pl, _collinear_genes.pl and _cluster_analysis.pl. These tasks can be performed serially on a computer (default) but can also run in batches over a high performance computer cluster. Four types of sequences (cDNA, CDS [amino and nucleotide] and gDNA) are cut so that they can be subsequently added to pangene clusters. cDNA and CDS sequences are cut with GffRead with arguments -w, -y and -x (Pertea & Pertea, 2020). Genomic segments (gDNA) are cut with bedtools getfasta (Quinlan & Hall, 2010). WGAs in PAF format are computed with minimap2 with parameters --cs -x asm20 --secondary=no -r1k,5k (Li, 2018) or GSAlign with parameters -sen -no_vcf -fmt 1 (Lin & Hsu, 2020). Unlike minimap2, GSAlign provides ANI estimates.
Feature overlap is computed with bedtools intersect with parameters -f 0.5 -F 0.5 -e (Quinlan & Hall, 2010) after converting WGAs to BED files, which requires parsing the CIGAR strings contained in PAF files. When features are actual gene models, strandedness is also required. With the exception of dataset ACK2, which includes two genomes with low ANI, the analyses presented in tables 2 and 3 were obtained with get_pangenes.pl and optional

Pangenome terminology
We will often use pangenome-related terms to describe the pangene clusters output by our protocol. We define occupancy as the number of genomes represented in a cluster. Core clusters contain sequences from all analyzed genomes. Soft-core clusters contain sequences from 95% of the input genomes. Finally, in this paper shell clusters are those with less occupancy than soft-core clusters after excluding singletons (occupancy=1).

Dotplots
The _dotplot.pl script can be used to make a genome-wide dotplot of collinear gene models resulting from a pairwise WGA stored in TSV format. This is done in two steps: i) the TSV file is converted to a PAF file, ii) the dotplot is produced with R package pafr, available at https://github.com/dwinter/pafr.

BUSCO analysis
In order to evaluate the completeness of the core and soft-core collections of pangenes produced by get_pangenes.pl, the corresponding protein FASTA files containing all known isoforms of genes were analyzed with BUSCO v5.4.3 (Manni et al., 2021). The poales_odb10 lineage was selected for all datasets except ACK2, where brassicales_odb10 was used instead.

Venn diagrams
Comparisons of CDS pangene sets produced with both WGA algorithms were carried out with script compare_clusters.pl from the GET_HOMOLOGUES-EST software (

Ensembl orthologues and InterPro annotations
High-confidence orthogroups produced with Ensembl Compara (Herrero et al., 2016) were retrieved with script ens_syntelogs.pl (Contreras-Moreira et al., 2022) from Ensembl Plants (Yates et al., 2022). These are derived from phylogenetic trees of aligned protein sequences from most plant species in Ensembl and have extra supporting collinearity evidence. Only pairs of orthologues with WGA score ≥ 50% and Gene Order Conservation (GOC) ≥ 75% were taken. In other words, only genes with ≥ 50% exonic coverage in Whole Genome Alignments and conserved neighbor gene order were retrieved and considered collinear. Note that GOC allows for inversions and gene insertions.
For ACK2 and rice3 genomes, annotated InterPro protein domains were retrieved from Biomart (Kinsella et al., 2011). These were taken to curate pangene clusters where all their sequences shared at least one InterPro domain.

Lifting-over gene models on genomic segments
The script check_evidence.pl uses precomputed collinearity evidence, stored in a TSV file, and lift-over alignments to project cDNA/CDS sequences on a reference genomic sequence.
Briefly, collinear genome sequences are extracted with bedtools getfasta and then pre-clustered cDNA/CDS sequences are mapped to them with GMAP, a software tool that efficiently connects exons while accurately defining splice sites and jumping intervening introns (Wu & Watanabe, 2005). GMAP is run with parameters -t 1 -2 -z sense_force -n 1 -F.
Increasing the verbosity of the script produces the actual GMAP sequence alignments, which are useful to inspect failed lift-over attempts (ie partially aligned proteins, premature stop codons, length not multiple of 3).

Flowering genes
A collection of 26 genes relevant in barley breeding due to their roles in flowering control and spike architecture was compiled. Pangene clusters were produced for the union of the barley20 dataset and two more high-quality gene annotations: MorexV3 (Mascher et al., 2021) and BaRTv2 (Coulter et al., 2022). The former is the IPK annotation from http://doi.org/10.5447/ipk/2021/3 (35,826 gene models, assembly GCA_904849725.1) and the latter the JHI annotation from https://ics.hutton.ac.uk/barleyrtd/bart_v2_18.html (39,281 gene models, assembly ERS4201450). The script get_pangenes.pl was run with arguments -s '^chr\d+H' -H -t 0. The resulting clusters were aligned with Clustalx 2.1 for manual curation (Larkin et al., 2007).

A protocol for calling pangenes based on Whole Genome Alignments
The first result from this work is the design of a protocol for calling pangenes in a series of related genomes. A pangene is defined as a gene model found within a homologous region in a set of genomes. In order to find pangenes, WGAs are computed, which in turn produce pairs of collinear genomic segments. Collinear evidence is stored in TSV files and can be produced by two WGA algorithms: minimap2 and GSAlign. The protocol is represented as a flowchart in Figure 1A. As illustrated schematically in Figure 1B, WGAs are used to project the coordinates of gene models across assemblies. By default, a pair of genes are said to be collinear when at least half the length of one matches the other in genomic space ( Figure   1C). Finally, clusters of genes (pangenes) emerge by merging pairs of collinear genes from different input taxa ( Figure 1D).
An example collinear region of Oryza sativa Japonica group (bottom) and Oryza nivara (dataset rice3) as displayed in the Ensembl Plants genome browser is shown in Figure 2, together with a summary of the supporting WGA evidence. Besides five 1-to-1 collinear gene pairs, it can be seen that the gene ONIVA01G00100 was mapped to two consecutive O.
sativa models (Os01g0100100, Os01g0100200) and that two O.sativa genes map to unannotated genome segments in O. nivara. In a nutshell, this figure shows that WGA evidence allows matching of long genes to split genes if they are collinear, as well as matching annotated gene models to homologous regions in other genomes, even if the genes in question failed to be annotated.

Benchmark on several plant datasets
A systematic benchmark of get_pangenes.pl was performed with the four datasets in Table 1.
In order to describe the performance several variables were collected. As computing WGAs is costly, particularly for large genomes, the maximum amount of RAM consumed by pairwise genome alignments was recorded. The size of the collinear fragments produced by WGA is captured in two variables, N50 and the percentage of fragments that contain blocks of 3 or more genes. Several further variables were also calculated to describe the clusters of collinear genes produced by the protocol: the total number of clusters, the number of pangenes present in all genotypes (core clusters), the pangenes present in 95% of genotypes (soft-core clusters) and the percentage of complete BUSCOs, which are universal single-copy orthologs. The results are summarized on  Table   S1 shows that the number of hits per gene is similar for both algorithms, with minimap2 failing to map more genes than GSAlign. Despite these differences, the numbers of pangenes clustered using WGA evidence from both algorithms are comparable: 86.3% of all clusters are identical (95.5% for core clusters, see Supplemental Figure S3).
As an extra test, clusters of cDNA and CDS sequences resulting from the rice3 analysis were aligned locally to compute their sequence identity both at the nucleotide and protein level.
Note that these clusters contain all isoforms annotated, so often there might be several sequences for the same gene. The results, plotted in Supplemental Figure S4, yielded median sequence identities of 99.6% for nucleotides (cDNA and CDS). For protein sequences the median values are 98.3% (GSAlign) and 98.1 % (minimap2). This means that annotated sequences clustered together are nearly identical, although, as seen in Supplemental Figure  S5, that does not guarantee that the same protein sequence is always encoded by clustered genes, as a result of divergent gene model annotation. Note that we also found some cases, 273/30,705 for minimap2 and 320/30,129 for GSAlign, where cDNA sequences of the same cluster could not be aligned. These occur when overlapping genes do not share exons.
Nevertheless, these pangenes are not filtered out by default as such gene models could encode loss of function alleles that might be valuable to capture. Table 3 also contain a summary of BUSCO analysis. BUSCOs are sets of universal single-copy orthologs tailored to different taxa and are typically used to estimate the completeness of genome assemblies. In this context, BUSCOs provide a biologically meaningful metric based on expected gene content (Manni et al., 2021). The ACK2 dataset contains the two most divergent genomes, with 84.7% nucleotide identity (see Table 3). The core pangenes found across A. thaliana and A. lyrata contain a higher percentage of complete BUSCOs with minimap2 than with GSAlign (94.1% vs 74.9%). In the rice3 dataset the nucleotide identity rises to 96% and the core pangene set contains ca. 85% complete BUSCOs with both WGA algorithms. As BUSCO analysis does not make sense for a single chromosome, the wheat dataset was left out. As for the barley20 dataset, the nucleotide identity is generally higher than that of rice3 and the benchmark produced core sets with close 86.6% (minimap2) and 61.6 % (GSAlign) complete BUSCOs. This number increased to 95.3% and 82.5% when all soft-core pangenes are considered, revealing a superior performance of minimap2 in this dataset. To put all these BUSCO scores in perspective please see the scores of individual input genome annotations in Supplemental Table S2.

Tables 2 and
Finally, Table 3 also shows the Average Nucleotide Identity values computed by the GSAlign algorithm for pairwise genome alignments. These values are useful to measure the divergence of the genomes being compared.

Comparison to ancestral karyotype and Ensembl orthogroups
Additional analyses were carried out to gain insights into the performance of our protocol by comparing our results to data produced independently.
First, we estimated its recall on the ACK2 dataset, which represents the most difficult scenario tested due to having the lowest nucleotide identity. This experiment counted the number of collinear genes identified by minimap2 and GSAlign within 23 blocks of the  Figure S6, demonstrate that our protocol is able to cluster together overlapping gene models which might be split or incomplete. This would illustrate why pangene clusters are more likely to group together multiple sequences from the same species.
Note that incomplete gene models, or clusters with sequences of contrasting length, would in turn explain why some clusters contain sequences that do not encode a common protein domain (see Supplemental Figure S7). The table also shows that over six thousand shell pangene clusters are produced that pair an annotated gene model with an overlapping genomic segment (gDNA) from a different species. Supplemental Table S4 shows how core and shell clusters are represented as a BED-like pangene matrix produced by get_pangenes.pl .

Confirming gene Presence-Absence Variation
A use case of pangenome analysis of plants and other organisms is to find genes which might be present and functional only in some individuals or populations. These would be annotated by our pipeline as shell genes. As discussed in the previous section, the get_pangenes.pl protocol can produce gDNA sequence clusters that contain the genomic intervals of annotated gene models plus overlapping unannotated segments from other species. This feature takes advantage of pre-computed WGAs, which match genomic segments whether they harbor gene models or not. Such clusters can effectively be used to lift-over or project gene models from the species where they are annotated to other individuals. In particular, CDS or cDNA isoform sequences from annotated genes can be mapped to the matching genomic segments and the resulting alignment will directly confirm whether exon/intron boundaries and the embedded coding sequence are conserved. When conserved, the segment likely contains an overlooked gene model; when not, it probably contains a gene fragment or a pseudogene. A flowchart in Supplemental Figure S7 summarizes how the script check_evidence.pl retrieves the WGA evidence supporting a pangene cluster, defines consensus and outlier isoform sequences, and then projects consensus CDS or cDNA sequences on candidate genomic segments with GMAP (See Materials and Methods). Note that, in addition to missing gene models, lifting-over can also merge split gene models or, conversely, divide gene models that might have been merged during gene annotation. Either way, when the projection succeeds and a complete open reading frame is aligned, a patch GFF file is created that conveys the genomic coordinates of the projected gene.
Using shell CDS clusters of occupancy >9 resulting from the minimap2 analysis of dataset barley20 we carried out a survey to see how often the different scenarios (missing, split, merged gene) occur in a real dataset. Out of 41,655 clusters, our approach detected 74 cases where a long model could be potentially corrected, 30 cases of incorrectly split genes and 9,839 potentially missing genes. We selected one candidate missing gene to illustrate the most common situation, pangene Horvu_MOREX_1H01G011400. In this example the original pangene grouped together gene models from 13 barley genotypes, supported by the WGA evidence shown in Supplemental Table 5. When CDS nucleotide sequences from those 13 cultivars were aligned to candidate genomic segments of the remaining genotypes, a perfect match was found in the genome sequence of OUN333. The encoded lifted-over protein sequence is shown at the bottom of the multiple alignment in Figure 3, being identical to others in the cluster. The resulting patch GFF file that would add this gene model to OUN333 is shown in Figure 3B. This test case suggests the combination of WGA evidence and gene model projection could be a powerful way to refine gene annotation across individuals of the same species. Moreover, as shown in the example, lift-over alignments can be used to confirm or reject observed PAV. In this case we can hypothesize this gene model is actually present in cultivar OUN333, and probably missing in the remaining genotypes.
Additional data beyond sequence evidence would be required to fully characterize such genes, such as expression data and, ultimately, proteomics evidence.

Curation of barley flowering genes
The protocol for pangene clustering was further tested with an increased set of barley assemblies and gene annotations. In this final experiment the barley20 dataset plus the high-quality MorexV3 and BaRTv2 gene annotations were pooled and a subset of 26 genes known to regulate flowering and spike architecture extracted from the resulting clusters. The curated results are summarized on Table 5, where it can be seen that in all cases these genes were found in 20 or more barley annotations. The results in this gene survey were curated and we found some cases (HvFT3/Ppd-H2, HvLUX, HvGRP7b, HvLHY) where a gene was missing from a cluster in some cultivars but there was a candidate genomic region harboring part of that sequence. That would be the case for cultivars Igri and HOR3081 and locus HvFT3/Ppd-H2. As explained in the previous section, we lifted-over the collinear CDS nucleotide sequences from all the other cultivars and we obtained identical matches but only for exon 4 (see Supplemental Figure S9), replicating previous observations that exons 1 to 3 of this gene have been deleted in some genotypes (Kikuchi et al., 2009).
Another interesting case was locus HvPRR37/Ppd-H1, which was found to be absent also in cultivar Igri, despite the gene being cloned in this cultivar with accession AY970701.1 (Turner et al., 2005). As it turns out, the Igri genome assembly placed this gene model (Horvu_IGRI_Un01G026500) outside of the pseudo-chromosomes; instead it is located in chrUn, and therefore it cannot be collinear to the homologous genes in other cultivars when only homologous chromosomes are compared (option -s).
We also found several instances where genes were found in inverted genome regions, known to be valuable to reconstruct the history of crops (Zhou et al., 2023). In the case of HvCEN this observation matches previous reports (Jayakodi et al., 2020), but we found other cases such as HvELF3 or HvLUX. These examples show the value of using Whole Genome Alignments for the definition of pangenes, as the strand of genes conveys chromosomal and evolutionary information.
Finally, in this set of genes used by barley breeders we also observed instances of gene models found to be split in some annotations (see Supplemental Figure S10) or genes with extra tandem copies. An example of the former is cluster HORVU.MOREX.r3.3HG0311160, which corresponds to barley locus HvOS2, which encodes MADS-box protein ODDSOC2 (Greenup et al., 2010), is depicted in Figure 4.
Altogether, these results highlight the challenges of consistently annotating gene models in individuals of the same species and confirm that analyzing soft-core, instead of core genes, is probably a good idea to tolerate the situations described in this section.

ACKNOWLEDGEMENTS
We are grateful to all members of the PanOryza consortium for feedback during the development of this work. We thank the Gramene team for continuous support and cooperation, as well as members of the Ensembl team for developing and maintaining the     Table 5. Survey of flowering-related pangenes in a collection of 22 barley genotypes, including the barley20 dataset plus the MorexV3 and BaRTv2 high-quality annotations. Column 'Occup' indicates how many annotations contain a gene model, with the number in brackets being the number of matched genomic segments in cases where a gene was absent in some cultivars. Column 'Split' tells how many gene models are split in each pangene cluster with respect to the mode gene model. Column 'Tand' says how many extra tandem gene models are in each cluster. Column 'Inv' states how many genes were found inverted in Whole Genome Alignments. matrices, as well as to estimate pan-, soft-core-, and core-genomes. GSAlign also produces Average Nucleotide Identity (ANI) matrices. Several tasks can be fine-tuned by customizing an array of parameters, of which alignment coverage is perhaps the most important. B) WGA of genomes A and B produces BED-like files that are intersected with gene models from B. Intersected coordinates are then used to transform B gene models to the genomic space of A. Finally, overlapping A gene models on the same strand are defined as collinear genes.
C) Feature overlap is computed from WGAs and gene coordinates from source GFF files.
When checking the overlap of A and B gene models, strandedness is required. Overlaps can also be estimated between gene models annotated in one assembly and matched genomic segments from others. D) Making greedy clusters by merging pairs of collinear genes. This algorithm has a key parameter, the maximum distance (in genes) among sequences of the same species that go in a cluster (default=5). Its effect is illustrated on the right side, where gene g34 is left unclustered for having too many intervening genes. sativa. This evidence can be used to identify collinear genes that take equivalent positions in different genomes, as illustrated with gene models ONIVA01G00130 and Os01g0100500, which overlap over 2.4kb (white rectangle). The example shows that overlapping gene models might share only some exons. The table below shows the collinear gene models identified based on minimap2 ang GSAlign alignments, together with the corresponding overlapped based pairs. Horvu_MOREX_1H01G011400. This cluster contains isoforms from 13 gene models, but none from genotype OUN333. The last sequence is encoded by a CDS sequence lifted-over from cultivar HOR3081 on the genome of OUN3, spanning 3 exons (exon boundaries are marked with asterisks. B) Patch GFF file with the coordinates of the exons lifted-over from gene model Horvu_3081_1H01G015200. The underlying CDS nucleotide sequence was aligned with 411 matches, no indels and no mismatches with check_evidence.pl -f. in other barley assemblies and annotation sets. In this example the BarkeBaRT2v18 gene is split in two partial models. Note that white gene models might not be collinear as they could be encoded in a different genome fragment. Figure generated with script check_evidence.pl and pyGenomeViz (https://github.com/moshi4/pyGenomeViz).