KegAlign: Optimizing pairwise alignments with diagonal partitioning

Our ability to generate sequencing data and assemble it into high quality complete genomes has rapidly advanced in recent years. These data promise to advance our understanding of organismal biology and answer longstanding evolutionary questions. Multiple genome alignment is a key tool in this quest. It is also the area which is lagging: today we can generate genomes faster than we can construct and update multiple alignments containing them. The bottleneck is in considerable computational time required to generate accurate pairwise alignments between divergent genomes, an unavoidable precursor to multiple alignments. This step is typically performed with lastZ, a very sensitive and yet equally slow tool. Here we describe an optimized GPU-enabled pairwise aligner KegAlign. It incorporates a new parallelization strategy, diagonal partitioning, with the latest features of modern GPUs. With KegAlign a typical human/mouse alignment can be computed in under 6 hours on a machine containing a single NVidia A100 GPU and 80 CPU cores without the need for any pre-partitioning of input sequences: a ~150× improvement over lastZ. While other pairwise aligners can complete this task in a fraction of that time, none achieves the sensitivity of KegAlign’s main alignment engine, lastZ, and thus may not be suitable for comparing divergent genomes. In addition to providing the source code and a Conda package for KegAlign we also provide a Galaxy workflow that can be readily used by anyone.


Introduction
International efforts such as the Earth BioGenome Project aim to produce reference genomes for all ~1.8 million known eukaryotic species over the next decade [1][2][3][4].Such initiatives combined with continuing improvements in sequencing techniques and assembly methods ensure steadily increasing growth in the number of fully sequenced genomes.However, a sequenced genome is just a file with little utility on its own.Comparisons to other genomes is critical for understanding its functional landscape in order to identify constrained (functional) and rapidly changing (evolving) regions.
Performing such comparisons in practice means aligning sequences and involves three steps: (1) generation of pairwise local alignments, (2) filtering and ordering of local alignment blocks, and (3) generation of multiple alignment blocks from filtered and ordered local alignments.The first step necessarily involves local alignments, because frequent and ubiquitous structural changes: duplications, inversions, deletions, insertions at multiple scales as well as chromosomal fissions and fusions disrupt the linear order of genomes.Once local alignments are computed, they are post-processed to remove spurious hits, and grouped into chains and nets [5] to define larger syntenic regions.A chain is an ordered sequence of local alignments; and a net is a hierarchical collection of chains covering nucleotides of a given genomic region once [6].Pairwise alignments representing homology anchors between multiple species are then further combined into multiple sequence alignments (MSA) using different methods (see [7,8]).

Genome pair count
Genome Sizes Average CPU Hours A number of purpose-built tools were developed in the early-to-mid 2000's for this task: BLAT [6], CHAOS [7], PatternHunter [8], MUMmer [9], and blastZ [9].A significantly improved derivative of blastZ-lastZ [10]-now used as the primary local-alignment engine for many production-level multiple alignment pipelines, including multiZ [11] and ProgressiveCactus [12], due to its superb sensitivity across a wide range of sequence divergence levels (see Results).However, lastZ was not built for speed: recent benchmarks put the runtime of a mammalian whole-genome alignment at around 2,700 CPU hours (Table 1).Therefore, reproducing the standard UCSC multiZ alignment of 100 vertebrates using the human genome as the reference would take around ~30 CPU years.This makes lastZ the main bottleneck preventing MSA generation at scale and complicating our ability to keep up with the rapid pace of multiple large genome sequencing initiatives.
Because of this limitation currently only a few research groups in the world are able to routinely generate whole genome alignments.There are two reasons for that.First, because alignment with lastZ is slow, it is parallelized by dividing input sequences into smaller fragments and running resulting jobs on separate nodes of a large computing cluster.This approach creates logistical challenges stemming from the need to manage hundreds of thousands of distinct alignment runs-a job requiring considerable system engineering and administration expertise.Second, appropriate computational infrastructure is needed to compute alignments.Access to modern clusters remains a hurdle (both financially and logistically) for many research groups even in developed countries, as it requires expertise in resource procurement, configuration, and maintenance along with fiscal resources.
Our goal was to make pairwise genome alignment universally accessible by optimizing it such that it can be performed on a single compute node within several hours.To achieve this goal we initially chose a GPU-accelerated replacement for lastZ called SegAlign [13].It accelerates the 'Seed' and 'Filter' stage of the alignment process (for a detailed review of the seed-filter-extend paradigm see [10]).However, after initial tries we discovered that SegAlign does not utilize hardware resources effectively.Specifically, often all hardware resources-both CPU and GPU-wait for a few long-running CPU threads to complete their work.In extreme cases, a single CPU thread can work for over 93% of the total runtime while all other hardware resources-both CPUs and GPUs, stay idle.
This problem cannot be fixed by adding more cores or GPUs: doing so exacerbates the problem by resulting in an even larger fraction of idling resources.We also observed that this underutilization is input dependent.Highly similar genomes are more affected by this problem, due to more work done for 'gapped-alignment' in the 'extend' stage resulting in a CPU bottleneck, rather than GPU.
In this manuscript we refactor SegAlign by introducing diagonal partitioning and leveraging the latest Nvidia GPU features-Multi-Instance GPU (MIG) [25] and Multi-Process Service (MPS) [26].The new tool, KegAlign, is distributed via a Conda package and our Galaxy system by allowing production-scale alignment generation.This service is freely and readily available to anyone in the world with an Internet connection.

Results and Discussion
lastZ is still the most sensitive tool for aligning divergent genomes lastZ has been around for ~20 years.In this timeframe new tools such as minimap2 [14] or fastga [15] have been developed that outperform lastZ dramatically in terms of speed (Table 2).So why use it?The key advantage of lastZ is its ability to find alignments between highly divergent sequences.To demonstrate this we created a set of "homologous" sequence pairs of varying lengths (1, 2, 5, and 10 thousands nucleotides) with divergence varying from 0% to 40% in 1% increments (164 sequence pairs in total; see Methods).Random 1,000 nucleotide flanks were added to 5' and 3' of each sequence.Fig. 1A shows the relationship between alignment coverage (the fraction of the sequence that is included in the alignment) and divergence/length.Only lastZ consistently generates end-to-end alignments across all divergence and length categories.To verify that these lastZ advantages are observed with real data we generated whole genome alignments between hg38 and mm39 builds of human and mouse genomes, respectively.We then computed the fraction of coding exons covered by alignments.In both cases overlapping alignments and exons were merged to eliminate redundancy (see Methods).lastZ alignments covered the absolute majority of protein-coding exons (Fig. 1B) suggesting that it is well suited for comparing genomes from evolutionarily divergent species.Since the genomes of thousands of species sequenced within the framework of Earth BioGenome Project and other initiatives are expected to have the full range of evolutionary distances lastZ remain a critical tool for generation of pairwise alignments.

GPU-accelerated aligners exist but require optimization
Goenka et al. implemented an accelerated version of lastZ called SegAlign [13].It uses a graphical processing unit (GPU) to speed up the main lastZ bottleneck (Fig. 2): the Seed-and-Filter stages of the alignment process.The two stages are parallelized on GPU and send their output to the Gapped Extension stage, which is performed by an unchanged lastZ executable.This improves the speed of pairwise alignment by a factor of 5-10✕ [13].
Figure 2. A high-level representation of SegAlign's pipeline.Blue and green boxes represent steps performed on CPU and GPU respectively.SegAlign repeatedly takes a chunk of the input genomes, performs preprocessing on CPU to make it amenable to run on GPU, and performs the Seed and Extend steps to obtain a segment file, which contains a list of coordinates.Whenever a segment file is output, lastZ is called to perform the Extend stage (i.e.gapped-alignment using the Y-drop algorithm on the coordinates [10]).
As shown in Fig. 2 SegAlign separates the alignment process into GPU-and CPU-bound components.
The amount of computation dedicated to each component depends on the sequence divergence between the two genomes being aligned.For inputs with high divergence (e.g.Human vs. Mouse) GPU becomes the bottleneck.This occurs since there are fewer matching alignments for gapped extension.Conversely, for inputs with low divergence (e.g.Human vs. Chimp), CPU becomes the bottleneck due to having many more alignments for gapped extension.

The CPU underutilization problem
SeqAlign splits input sequences into smaller, equally sized segments and processes them in parallel as separate tasks.However, using equally sized segments does not necessarily lead to equal amounts of work to be performed on each segment pair.The difference in work can be observed in the Gapped-Extension part of the alignment, which is caused by the final number of alignments in a given region of both genomes.Imbalanced partitions can lead to significant under-utilization of hardware resources due to what we call tail latency.For example, in the case of human vs. chimpanzee alignments (Fig. 3) CPU is idle over 50% of the total run time.In fact, for pairs of genomes with low divergence such as human/chimp some regions can have large numbers of seeds, which can lead to a disproportionately large amount of time spent aligning these regions (10,000 times longer than the median runtime in some cases; Fig. 4; also see Supp.Fig. 1).Fig. 4 illustrates the disparity of runtimes in different regions for a human-chimp alignment.This results in the entire alignment not completing until the most time-consuming pair of chunks are completed, which is a typical example of the tail latency problem.Adding additional resources does not help with this problem, since each pair of regions is assigned to a single CPU, and additional CPUs stay idle waiting for these pairs of chunks to finish (Fig. 3).In fact, for pairs of genomes with low divergence such as human/chimp some regions can have large numbers of seeds, which can lead to a disproportionately large amount of time spent aligning these regions (10,000 times longer than the median runtime in some cases; Fig. 4).This results in the entire alignment not completing until the most time-consuming pair of chunks are completed, which is a typical example of the tail latency problem.Adding additional resources does not help with this problem, since each pair of regions is assigned to a single CPU, and additional CPUs stay idle waiting for these pairs of segments to finish.

The GPU underutilization problem
We observed two different cases where GPU is underutilized.For inputs with low divergence, there is not enough incoming work to fully utilize the GPU.The first case can be observed in a human-chimp alignment in Fig. 3A where GPU utilization sharply decreases after ~2 hours.This case is due to GPU data starvation-not enough incoming work to fully utilize GPU-and is caused by two factors.First, CPU starvation due to a large amount of CPU work in the gapped extend stage monopolizing all of the CPU compute, delaying the preprocessing required to begin work on GPU (Pre-process stage in Fig. 2).Second, communication deadlock is caused by the inter-process communication buffer (UNIX pipe) running out of space, due to work arriving much faster than CPUs can execute them.This leads to lastZ output commands (written to the stdout buffer) in SegAlign being blocked, preventing additional GPU work from beginning, as the SegAlign process is waiting for free space in the communication buffer before continuing.These not only lead to an increased runtime, they also cause the GPU to idle for long periods of time, up to several hours, preventing other users and applications (potentially other alignments) from utilizing the GPU.For the second case of GPU underutilization, we observe that GPU utilization ranges between 90 to 100% for all types of inputs even without GPU starvation, as seen in Fig. 3B in the first ~2 hour period.This is caused by small inefficiencies in the GPU kernel (code) used in SegAlign leading to decreased GPU utilization.While this can be solved by rewriting the GPU kernels more efficiently, we instead chose to take advantage of new NVIDIA GPU features MIG and MPS to increase GPU utilization without changing SegAlign's code [25,26].

Solving the CPU underutilization challenge with diagonal partitioning
SegAlign, which runs on GPU, performs the seed and extend steps of the alignment process (Fig. 2) and outputs segment files which contain High Scoring Ungapped Alignment Pairs (HSPs).In SegAlign each segment file contains all of the HSPs from a square chunk in the alignment matrix, which by default is 10✕10 MB (set using the lastz_interval_size parameter).A lastZ process is run for each segment file generated by SegAlign and it runs on a single CPU.However, the number of HSPs within this chunk can range from one to over 10 million and thus can cause dramatic variability in runtimes, as illustrated in Supp.Fig. 1.This, in turn, leads to a tail latency problem in which GPUs and most CPUs cores stay idle while few CPUs (running lastZ threads) remain occupied.As can be seen from Fig. 4 nearly all lastZ executions are completed in less than 10 minutes, however the longest execution takes over 27 hours.Thus the entire runtime is lower bound by the longest execution time (Fig. 3).
To solve the tail latency problem, we limited the maximum work for a given lastZ run by restricting the number of HPSs in the segment File.If the number of HSPs is above a certain specified threshold, the segment file is split into two and so on.However, naively splitting segments (blocking, row-wise partitioning, etc.) leads to overlapping alignments and increases output file size (up to several times larger).This is illustrated in Fig. 5.One issue is that lastZ keeps track of previous extended anchors-a point residing on an HSP where ungapped alignment begins-in the input segment file.If an extension goes through another anchor (which is chosen from the segment), that anchor is not extended later.If these segments/anchors are in separate files (and thus will be processed by separate lastZ runs), then both anchors are extended, leading to overlapping alignments.Furthermore, in lastZ, the dynamic programming tiebreakers for a forward and reverse extension are different (each anchor is extended both forward and backward), so the additional segments/anchors being extended can also lead to additional alignments due to different (but equally scoring) paths being taken.It is best to partition in a way that anchors which would be located on the same alignment, are also located in the same segment file.Thus, a fundamentally different approach to partitioning is necessary.
We observe that dependent anchors occur across diagonal regions, since alignments are extended diagonally overlapping, thus dependent anchors occur only along the alignment direction.Anchors are extended in a forward diagonal, for forward alignments, and a reverse diagonal, for reverse alignments, with few horizontal and vertical movements corresponding to gaps.Therefore, we can choose our partition shape to follow along this diagonal to minimize dependent alignments occurring in different partitions leading to straddling alignments (additional and duplicate alignments which are caused by violating dependencies when partitioning).This results in a partition as shown in Fig. 6.Here each colored region indicates a partition, with an equal number of anchors in each.Ideally we want to have the same amount of computation for each partition, which would result in each partition taking roughly the same amount of time.We empirically observe that, in practice, the number of segments is directly proportional to the time it takes for a partition to complete, thus we use the number of segments as a heuristic to determine the amount of work per partition.
Additionally, each row in a segment file does not contain the exact coordinate of the respective anchors, it contains the start and end positions for both the query and reference genomes which indicate the region of likely homology, but there is the possibility that the optimal alignment may include only some regions of the HSP [24].Therefore, to take into account this possibility, the HSP is reduced to a single point between these two coordinates called the anchor in which the ungapped alignment will begin.This is performed in lastZ by calculating the point, along each HSP in the segment files, where the optimal alignment is most likely to include.However, since we do not want to introduce additional overhead when partitioning by calculating anchor positions, we assume that the anchor is chosen in the exact midpoint of the two coordinates and partition accordingly.While our diagonal partitioning scheme does not completely eliminate straddling alignments, it reduces them to a negligible number, as shown below.These mainly occur in regions of the input genomes where many anchors are located within a small region resulting in narrow diagonal segment widths.This, in turn, results in a higher chance of alignments passing through boundaries due to vertical and horizontal movement caused by gaps.SegAlign's partitioning scheme for Seed and Extend steps also introduces a small amount of straddling alignments which are part of the reported additional alignments in SegAlign's paper [13].
Diagonal partitioning has an additional benefit of reducing the total amount of CPU work performed in lastZ, as shown in Fig. 7, which further reduces runtime.This is due to lastZ managing a data structure with all ungapped portions of alignments found so far, which is queried during extension of a current anchor to determine bounds.However, operations on this data structure have quadratic time complexity, thus by limiting the number of HSPs in a segment file, the more efficient it is to make operations on this data structure.
Diagonal partitioning introduces a new parameter, the maximum number of segments to include in each partition.The optimal choice to obtain minimum runtime for this parameter depends on two input genomes-their size and similarity-and on the number of CPU cores available.The larger the input genomes and the more similar they are, a larger maximum segment file size is optimal (around 15,000 for human-primates; see Fig. 7).Determining the optimal value for this parameter, for minimum runtime, can be difficult without experimentation.Therefore, we introduce a method to dynamically choose this parameter during runtime, which does not require explicit communication between partitions.Whenever a segment file is made, our partitioning scheme chooses the maximum segment file size as the upper quartile of the original sizes (i.e.before partitioning) of previously generated segments files.Each segment file is written to a shared folder, thus explicitly keeping track of segment sizes is not required.We empirically found that this gives a value within 10% error margin of the optimal maximum segment file size parameter.Additionally, to get the number of lines in a segment file, we divide the file size of each segment file by the size of a single line from the segment file we are partitioning.This gives a close approximation of line sizes without requiring reading through the entire segment file.

Solving GPU underutilization problem
Above we listed three issues with GPU utilization: CPU starvation, communication deadlock, and inability to fully utilize all available GPU resources.To address the CPU starvation problem we decreased the priority of the lastZ threads (using the Linux nice command) so that the CPU preprocessing work required for the GPU work to begin will not be delayed by multiple threads running gapped-extension.To solve the second problem, communication deadlock, we first identified the root cause -the UNIX pipe buffers used for communicating between SegAlign and lastZ running out of memory.We simply increased the maximum buffer size to solve this problem.For the last problem, inability to fully utilize all available GPU resources, we use Nvidia's MIG and MPS features [25,26] to run multiple SegAlign instances in parallel on the same GPU.This results in at least one SegAlign instance fully utilizing GPU resources whenever other instances do not.To accomplish this, we split the input genome across pairs of chromosomes and use the Longest-processing-time-first Algorithm [16] to assign multiple chromosomes into similar sized bins and run a SegAlign instance on each pair of bins.For example, whole genomes contain many chromosomes of different sizes and we assign subsets of these into a bin so that each bin contains around 200 million base pairs (depending on the size of the genome the number of bins may vary).Splitting across chromosomes rather than splitting individual chromosomes in half gives two benefits.First, it prevents additional straddling alignments from occurring, since alignments do not extend across multiple chromosomes.Second, it does not require recalculating alignment coordinates after alignment, which would occur had we split individual chromosomes.We tried different combinations of MIG instance sizes, MPS processes per MIG instances, and partition sizes.We found that having seven MIG instances and up to four MPS instances gives the best results in runtime for smaller sized inputs.For large genome inputs, three or less instances are used if GPU memory is not sufficient.Care must be taken so that GPU memory does not run out when all possible SegAlign instances are running on the GPU.This can be done by limiting the total number of SegAlign instances running on a GPU and by limiting the sizes of the bins for each genome.We used a bin size of around 200 million base pairs, which results in roughly 12 GB of GPU memory used by each SegAlign instance.While this method has the downside of using more GPU memory, devices which support MIG are datacenter GPUs which typically have over 80GB of GPU memory, most of which goes unused when running a single SegAlign instance.For cases where MIG is unavailable, we found using only MPS up to four instances also gives similar speedup.

Evaluation across evolutionary distances
Fig. 8 shows the runtime results for aligning multiple vertebrate genomes using SegAlign with default parameters and SegAlign with diagonal partitioning optimization.All results were obtained running on 32 Intel Xeon Gold 6330 CPUs and 1 Nvidia A100 GPU, with 200 GB of memory.The results show that genomes with low phylogenetic distance, such as human (hg38 and hs1), aligned with chimpanzee (panTro6), and gorilla (gorGor6) genomes have the greatest improvement from our diagonal partitioning strategy.However, for genomes with high phylogenetic distance, such as human aligned with mouse (mm10 and mm39) and chicken (galGal6) there is little to no improvement using diagonal partitioning.This is because the amount of work for the filter step, for determining the final alignments, depends directly on how similar the aligned genomes are.For the MIG/MPS results, we see that low and high phylogenetic distance genomes benefit from this optimization, but to a much lesser extent than diagonal partitioning; we observe between 10 to 20% improvement in all genomes.However this optimization is complementary to our diagonal partitioning approach and we obtain the best results when combining both optimization strategies.

Democratizing of genome alignments with KegAlign workflow
To make our work useful to the broad research community we incorporated KegAlign in the Galaxy system developed by our group [17].It consists of two components: a KegAlign tool and a specialized version of lastZ called "batched-lastZ".The KegAlign tool performs the "Seed-and-Filter" stage of the alignment generation and is configured (see Methods) to run on a GPU-containing node.The tool outputs a "keg"-compressed tar archive containing partition files with coordinates of segments for performing gapped-extensions and a list of lastZ commands to process these segments.The number of commands and the number of segments depends on size of the input genomes and their divergence.For example, aligning human genome build hg38 against build hs1 generates 5,490 partitions containing a total of 18,227,861 segments.For each of the 5,490 partitions there is a corresponding lastZ command.The batched-lastZ accepts the Keg archive as an input and executes all lastZ commands (in the above example that is 5,490 lastZ commands) listed there.In Galaxy the two tools are combined into a workflow shown in Supp.Fig. 2. The workflow accepts the following inputs: a target sequence, one or more query sequences, and an optional substitution scoring matrix (by default the HOXD70 scores are used [18]).It outputs alignments in a variety of formats that can be specified by the user including MAF (default), BAM, and a variety of configurable tab-delimited flavors.
Performance of the workflow's two main components, KegAlign and batched-lastZ, are shown in Table 3.The tools are run on ACCESS-CI infrastructure as described in Methods and are accessible to any user of https://usegalaxy.orginstance.
In addition of running tools via the graphical user interface (GUI) they can be run programmatically using Galaxy's planemo framework [19] against an existing public Galaxy instance as follows: planemo run ca68ad0d7420c615 job.yml--engine external_galaxy --galaxy_url https://usegalaxy.org--galaxy_api_key YOUR_API_KEY In this case workflow id (ca68ad0d7420c615) corresponds to the KegAlign workflow shown in Supp. Fig.

Discussion
To our knowledge, there is no prior work on partitioning segment files.There are several works that take into account the diagonal extension property of genome alignment.These methods focus on decreasing the search space of dynamic programming, but none are used for partitioning data.Several papers [20,21] use a method of solving dynamic programming tables called the wavefront algorithm, which takes advantage of the fact that k-mers (or seeds) are extended in a diagonal direction.Wu et al.
[22] introduces a method called diagonalization, which also takes advantage of the diagonal extension property of k-mers and uses this for limiting the search space for "oligomer chaining", a method for approximate alignment.
The advantage of our approach is in that it offers a practical solution to pairwise alignment of divergent genomes that can be readily used via a programmatic or GUI route.In particular, our next task will be developing a Galaxy-based service that would allow submission of pairwise alignments to be performed on our infrastructure using KegAlign workflows en masse.It will then be possible to chain the alignments and to proceed to generation of multiple alignments using either ProgressiveCactus or multiZ tools relieving the current bottleneck and enabling multiple genome sequencing efforts to rapidly generate large multispecies alignments.This service will not be limited to KegAlign and will also offer minimap2, fastga, and other pairwise aligners.

Figure 1 .
Figure 1. A. Comparison of three different pairwise aligners on a set of simulated sequences with different divergences (60 -100) and lengths (1,000, 2,000, 5,000, and 10,000).The Y-axis (0 -1) indicates what fraction of the sequence is covered by alignments generated by a given tool.This calculation excludes randomly added flanks.B. Fraction of human protein-coding exons (hg38) overlapping with alignments to mouse genome (mm39) produced by each of the three mappers.Coordinates of exons and alignment blocks were merged to avoid redundancy (see Methods).

Figure 3 .
Figure 3.The CPU and GPU utilizations of 32 CPUs and 1 A100 GPU for aligning human (hs1) and chimpanzee (panTro6) genomes.A. Severe CPU-and GPU-underutilization of the original SegALign implementation.This alignment completes in 38 hours.B. A SegAlign implementation modified by our group allows the same alignment to finish in 8.5 hours.

Figure 5 .
Figure 5. Example showing how straddling alignments can form.Each figure represents a dot plot which is commonly used to depict WGA alignments.Fig.(a) shows the behavior of two anchors which are located on a positive alignment (direction `+' in their segment file rows).Anchor 1 is extended in both directions and crosses over anchor 2, which as a result is not extended.Fig.(b) shows the two dependent anchors located in separate partitions (i.e.segment files) as a result of a column-wise partitioning strategy, depicted by the dotted line.Each anchor is processed independently in parallel, thus both anchors have no knowledge of the other.This leads to two separate extensions, denoted by the red and blue lines.Extensions with the same direction across the same region (regions 1 and 3) result in overlapping alignments.Extensions with different directions across the same region (region 2) result in additional alignments -both alignments follow two equally scoring paths, which are caused by different tiebreakers in the dynamic programming matrix used to calculate them.

Figure 6 .
Figure6.Example of diagonal partitioning for forward and reverse alignments.In (a) two genomes of length 10 base pairs are shown, with anchor locations shown as "+" and "-"."+" represents forward anchors, which will be extended along the positive diagonal and "-" represents reverse anchors, which will be extended along the negative diagonal.(b) and (c) show separate partitions of a maximum of 5 anchors (also referred to as segments), represented as different colors, for the segments in (a), resulting in a total of 8 partitions, each of which will be written into a separate segment file.Since partitions are aligned with the direction of extension, the chances of straddling alignments are minimized.However, it is still possible to have straddling alignments in regions where many anchors are close together, such as in (b) where the blue partition ends and the red partition begins.

Figure 7 .
Figure 7. Parameter tuning results for row-wise and diagonal partitioning schemes.Results obtained using 1 GPU and 32 CPUs.Output file size rapidly increases when using row partitioning as maximum segment file size is decreased.Decrease in runtime is due to two factors: (i) decrease in the longest lastZ execution time (Longest lastZ), and the decrease in the total amount of CPU work

Figure 8 .
Figure 8. Results for multiple whole genome vertebrate inputs.Genomes with low phylogenetic distance (high similarity) have a large decrease in runtime using segment splitting, while genomes with high phylogenetic distance have little to no improvements.MIG/MPS gives a moderate improvement in all cases.The best results are obtained by combining the two methods.

Table 1 .
Historic data from 4,780 pairwise alignments generated between 2005 and 2020 by the UCSC Genome Browser team.

Table 2 .
CPU time required for aligning chromosome 1 from assembly hg38 against chromosome 1 from assemblies hs1 and mm39.Alignment is performed using a single thread on an Intel i9 workstation.A single thread was used to make comparison with lastZ "fair" as it supports single thread only.The time is computed using the time UNIX command."Real" time is reported here.