Introgressed Manihot glaziovii Alleles in Modern Cassava Germplasm Benefit Important Traits and Are Under Balancing Selection

Introgression of alleles from wild relatives has often been adaptive, usually for disease resistance, in plant breeding. However, the significance of historical hybridization events in modern breeding is often not clear. Cassava (Manihot esculenta) is among the most important staple foods in the world, sustaining hundreds of millions of people in the tropics, especially in sub-Saharan Africa. Widespread genotyping makes cassava a model for clonally-propagated root and tuber crops in the developing world and provides an opportunity to study the modern benefits and consequences of historical introgression. We detected large introgressed M. glaziovii genome-segments in a collection of 2742 modern cassava landraces and elite germplasm, the legacy of 1930’s era breeding to combat epidemics disease. African landraces and improved varieties were on average 3.8% (max 13.6%) introgressed. Introgressions accounted for significant (mean 20%, max 56%) portion of the heritability of tested traits M. glaziovii alleles on the distal 10Mb of chr. 1 increased dry matter and root number. On chr. 4, introgressed alleles in a 20Mb region improved harvest index and brown streak disease tolerance. Three cycles of selection initially doubled the introgression frequency on chr. 1. Later stage variety trials selectively excluded homozygotes which indicates a heterozygous advantage. We show that maintaining large recombination-suppressed introgressions in the heterozygous state allows the accumulation of deleterious mutations. We conclude that targeted recombination of introgression segments would therefore increase the efficiency of cassava breeding by allowing simultaneous fixation of beneficial alleles and purging of genetic load. Significance Statement Crosses to wild relatives have often been adaptive for crop breeding, but their modern importance is usually poorly understood. Cassava (Manihot esculenta) is an important staple crop, feeding hundreds of millions in the developing world, and is a model for vegetatively-propagated non-inbred crops. In the 1930’s, crossing to M. glaziovii averted mosaic disease epidemic in Africa. We reveal that large genome segments, the legacy of those crosses, benefit a number of traits including yield in modern cassava and are consistently favored during selection. Elite cultivars are almost exclusively heterozygous for wild alleles; homozygotes are rejected during early stage trials, suggesting inbreeding depression. More recombination around beneficial wild alleles will allow purging of genetic load and increase genetic gain in cassava.

vegetatively-propagated root and tuber crops (13-16). 23 The history of cassava breeding includes periodic tapping of 24 wild con-generic relatives as sources of useful genetic variation 25 (7, 17). In the early 20th century, cassava production in Africa

Significance Statement
Crosses to wild relatives have often been adaptive for crop breeding, but their modern importance is usually poorly understood. Cassava (Manihot esculenta) is an important staple crop, feeding hundreds of millions in the developing world, and is a model for vegetatively-propagated non-inbred crops. In the 1930's, crossing to M. glaziovii averted mosaic disease epidemic in Africa. We reveal that large genome segments, the legacy of those crosses, benefit a number of traits including yield in modern cassava and are consistently favored during selection. Elite cultivars are almost exclusively heterozygous for wild alleles; homozygotes are rejected during early stage trials, suggesting inbreeding depression. More recombination around beneficial wild alleles will allow purging of genetic load and increase genetic gain in cassava.

D R A F T
leading to the distribution of mosaic tolerant varieties to Introgression-associated population structure. In order to de-   (Table S1). The IDMs were 97 distributed similarly across the genome compared to the rest 98 of the markers in our dataset (Fig. S3).

99
Principal components analysis (PCA) on the IDM markers 100 revealed a pattern of relatedness in introgression regions (Fig. 101  1A) that is distinct from that of the rest of the genome (Fig. 102 1B) or overall (Fig. 1C). We coded the dosages for the IDMs to 103 count the M. glaziovii diagnostic allele. The resulting loadings 104 (eigenvector coefficients) for markers on PC1 (21% variance 105 explained) are strongest for IDMs on the last 10 Mb of chr. 106 1, while the strongest loadings on PC2 (9%) are at IDMs 107 spanning the majority of chr. 4 (Fig. 1D).

108
Introgression frequency divergence among populations. The 109 genome-wide proportion of M. glaziovii alleles per clone ranged 110 from 1.3% to 13.6% (mean 3.8%) among the African breeding 111 germplasm as a whole (GG+LG+NR+UG; Fig. 2B; Tables 112 S2-3). On a genome-wide basis, there are not large differences 113 among populations in the mean frequency of introgressed al-114 leles. The breeding populations GG (4.2%), NR and UG 115 (4.1%) have similar levels of introgression, while the L. Ameri-116 can collection was the least introgressed (1.8%) and the local 117 germplasm (LG, 3%) were intermediate. We note that in 118 the CIAT collection, the Brazilian accession BRA534 appears 119 to be an outlier with 34% M. glaziovii alleles. We excluded 120 BRA534 when comparing CIAT to other populations.

121
The largest introgressions detected were apparently con-122 tiguous segments of chr. 1 approx. 25Mb to the end ( 10Mb 123 total) and chr. 4 from 5Mb to 25Mb ( Fig. 2A). When we 124 isolate the introgressions on chrs. 1 and 4, which appear to 125 be the same as were previously identified (29), we observe 126 more striking differences. The frequency of the chr. 1 segment 127 was on average greater in the W. African breeding germplasm 128 GG (0.2) and NR (0.21) than in the E. African population 129 UG (0.14). In contrast, the introgression on chr. 4 was more 130 common in UG (0.23) compared to GG (0.15) or NR (0.11). 131 Samples from the IITA local germplasm collection (LG) were 132 less likely to contain introgressions on either chrs. 1 (0.10) 133 or 4 (0.08) and the L. American samples from CIAT showed 134 almost no evidence of introgression (<0.02) on both chrs. 1 135 and 4 (Fig. 2B, Tables S2-3).

136
Ongoing selection for M. glaziovii alleles. We compared the 137 introgressions detected in local germplasm and landraces of 138 cassava (LG) to IITA improved varieties (GG) and three 139 successive generations of genomic selection progeny (C1, C2, 140 and C3), which descend from parents selected initially from 141 the GG. The most notable changes we observed were on chrs. 142 1 and 4 (Fig. 3A). Genome-wide, the average proportion of M. 143 glaziovii alleles per individual increased from 0.03 in LG to 144 0.042 in GG and then more than doubled in the GS progeny 145 with C1 at 0.095, C2 and C3 at 0.12 ( Fig. 3B; Table S2-3). 146 Most of this change was due to increases on chr. 1, which rose 147 from 0.1 in the LG to 0.2 in the GG and maxed out at 0.34 in 148 the C3. In contrast, the chr. 4 region appears to have stayed 149 steady around 15% from GG through C2 and even slightly 150 decreases from C2 to C3 (Fig. 3B, Tables S2-3).

151
Most introgressed LG and GG were heterozygous for M. 152 glaziovii haplotypes, with a mean homozygosity rate of only 1% 153 genome-wide ( Fig. 2A, Fig. 3C). Genomic selection appears to 154 have steadily increased the homozygosity rate on chr. 1 from 155

156
The near absence of homozygotes in the elite germplasm (GG) 157 and the gradual increase due to select that we observed, led 158 us to investigate further. 159 We hypothesized that post-genotyping performance-based  Table S2).    (Table S5).

201
To these data, we fit linear mixed-models with two random- between introgressed and non-introgressed regions. If unad-215 dressed, LD between SNPs in these regions could lead to 216 non-independent estimates of genetic variance when fitting the 217 model described above. The variance arising from introgres-218 sion regions might then be captured by the non-introgression 219 regions and vice versa (35-37). Using the procedure described 220 in the methods, we reclassified 1413 SNPs, primarily located 221 on chromosomes 1 and 4 (Fig. S7), that were more similar in 222 the kinship they measured to the IDM than to the non-IDM 223 (Figs. S5-7; Tables S1, S4). Redesignating these SNPs as tag-224 IDM reduced the correlation of IDM and non-IDM kinships 225 from 0.37 to 0.30. We therefore included tag-IDM in the IDM 226 kinship matrices used in all subsequent analyses.

227
Per-trial analyses. The second preliminary analysis we did was 228 to analyze each trial separately, in part in order to check 229 the quality of the data before combining into a larger, multi-230 trial analysis. Based on a likelihood ratio test, we chose to 231 remove 31 trait-trials that did not show evidence of significant 232 genetic variance (pLRT null > 0.05). We removed an additional 233 six trait-trials without any significant genetic variance from 234 marker-estimated covariances. Lastly, 53 more trait-trials 235 were excluded because, based upon the Akaike Information 236 Criterion (AIC), the genomic model fit the data worse than 237 the IID model (Tables S6 & S10).

238
Multi-trial analyses. We combined the remaining trials for each 239 trait (within Institute) to achieve large overall sample sizes 240 (max per Institute: 25924 IITA, 2881 NaCRRI, 6641 NRCRI) 241 and increase the average number of replications per clone (max 242 per Institute: 16.76 IITA, 6.89 NaCRRI, 8.11 NRCRI; Table 243 S11). We fit the three models for each trait and analyzed each 244 breeding program's data separately.

245
Ten out of 19 Trait-Institute analyses had significant genetic 246 variance from introgressions. In fact, introgression regions ap-247 pear important for every trait except cassava bacterial blight 248 and mosaic disease severity. In all of these cases, the PARTI-249 TIONED model had an AIC more than 2 units smaller than 250 the non-partitioned one. The proportion of the heritability 251 accounted for by significant introgressions was as high as 56% 252 (mean 20%, median 15%, min 3%; Fig. 4, Tables S12-13).

253
Comparison to random samples. One third of the SNPs in our 254 study were classified as IDMs (including tag-IDMs). We com-255 pared the variance explained by our IDM-defined partition, 256 to three random genome partitions of the same size (Table 257  S7). For the random samples, the correlation of GRM's was 258 >0.99, but was only 0.30 for the IDM-defined partition (Table 259  S8). The IDM-defined partition explained an average of 20% 260 of the total genetic variance, in comparison to 37% for the 261 random partitions, which is closer to proportional with the 262 total number of markers (Fig. S8, Table S12-14). The populations can be compared in B and C by looking down the columns and using the vertical lines, which represent the mean values for that group and region, as a visual aid. For A through C, each row (y-axis) is an individual cassava clone and the vertical panels represent five populations: IITA Genetic Gain (GG) and three successive generations of genomic selection progeny (C1, C2 and C3), descended originally from GG. Rows (clones) are aligned across A-C and sorted within population based on the genome-wide proportion M. glaziovii (left column of B). At the bottom, D shows how the introgression frequency and homozygosity rate per individual (y-axis) for the C1, C2 and C3 relates to the cumulative number of field plots planted (as of Jan. 2019) per clone (x-axis). The number of field plots per clone is meant as a proxy representing the level of advancement through variety development stages of the breeding process. For illustrative purposes, we highlight C1, C2 and C3 clones with >50 field plots in panels B and C with black diamonds. For D as in B and C, we break down the proportions in D by region and use the same color coding: genome-wide (orange), chr. 1 region (dark red), chr. 4 region (dark blue). Impact of introgressions on genomic prediction. Genomic se-329 lection (GS) is becoming an important part of modern cassava 330 breeding (16, 38). We investigated the impact of introgres-331 sion regions on genomic prediction accuracy, which is directly 332 proportional to their contribution to breeding gains during 333 GS, all other things being equal. We did ten replications of 334 five-fold cross-validation for each trait-institute dataset. We 335 tested five prediction models: non-partitioned (ALL markers), 336 genome-partitioned and IDM-null models. For the "genome-337 partitioned" and "IDM-null" models, we divided markers into 338 two kinship matrices either randomly or based on whether a 339 SNP was an IDM or not.

340
The accuracy of partitioned models was almost identical 341 to the non-partitioned model for both the IDM-based and 342 the random genome-partitions. However, removing the IDM-343 based component from the model tended to reduce accuracy, 344 especially in the NaCRRI data, on average 0.004 (max 0.04) 345 relative to the PARTITIONED and 0.005 (max 0.03) relative to 346 the ALL models (Fig. 6A). These comparisons provide a means 347 to measure the importance of introgression regions in ongoing 348 GS. In contrast to the IDM-based genome partition, removing 349 the equivalent random components decreased accuracy an 350 average 0.001 (compared to ALL) and -0.001 (compared to 351 PARTITIONED) (Fig. S15, Tables S19-20).

352
Finally, we observed that the size of the impact on predic-353 tion accuracy (measured from comparing ALL and IDMnull 354 models) scaled with the h 2 IDM with a correlation of 0.41 for 355 the IDM-based genome partition and -0.09 for the random 356 partition (Fig. 6B). The original impetus for interspecific hybridization at Amani 361 (circa 1930's) was to combat cassava mosaic disease (17, 18, 362 20, 22, 23). We observed consistent and beneficial M. glaziovii 363 allelic effects, however, we found neither a beneficial effect 364 nor a significant genetic variance for CMD. In a previous 365 article, focused on GWAS for CMD, we noted an absence of 366 major effect QTL other than CMD2, a dominant, possibly 367 multi-allelic locus (39). We verify here that the protection 368 derived from the CMD2 locus did not arise from introgression 369 as the only associated GWAS result on chr. 12 indicated 370   (Fig. 6; Fig. S15). This suggests that while introgressions are 424 clearly still important, having uniformly beneficial effects at 425 QTL (Fig. 5) and nearly doubling in frequency during three 426 cycles of GS (Fig. 3), their physical size is disproportionate 427 to their true economic value.

428
One theoretical prediction about introgressed alleles under 429 selection with suppressed recombination is that they can result 430 in the accumulation genetic load due to linkage drag (10, 11).

431
This is especially a concern for vegetatively-propagated non-432 inbred crops like cassava (33). We observed that putatively 433 deleterious alleles in introgression regions accumulated rela-434 tively faster (both LG to GG and GG to C1) compared to the 435 rest of the genome. We further observed balancing selection 436 in the form of an M. glaziovii homozygote deficit from variety 437 trials.

438
In clonally-propagated crops, selection for advancement 439 during variety trials is using total genetic merit rather than 440 breeding value based on performance in a series of field trials 441 with progressively more replicates, locations and increasing 442 plot size. The GS progeny that we analyzed (and thus the 443 sample in which we observed an initial increase in M. glaziovii 444 homozygosity) represent those that successfully germinated 445 and were vigorous enough early on to warrant genotyping. We 446 discovered that M. glaziovii homozygotes were excluded from 447 later stage field trials early in the process (Fig. 3D). This indi-448 cates there may be phenotypically-expressed negative effects of 449 these introgressions, which may be related to the accumulation 450 of homozygous deleterious mutations we observed. Linkage 451 drag in adaptive introgression regions has been proposed to 452 explain balancing polymorphism regions in cases including 453 human-Neanderthal hybridization (2) and wing mimicry in 454 Heliconius butterflies (43,44).

455
Introgression-associated inbreeding depression is thus a crit-456 ical topic for future investigation. At present, cassava breeders 457 are maintaining introgression heterozygosity at great cost, 458 through a multi-stage selection process. First, favored crosses 459 between heterozygous parents generates many unsuitable off-460 spring, homozygous for introgressions and suffering some as yet 461 unquantified inbreeding depression. Subsequently, field evalua-462 tions are required to identify and purge these individuals. We 463 propose that targeted recombination of introgression segments 464 would increase the rate of gain and sustainability of cassava 465 breeding by allowing simultaneous fixation of beneficial alleles 466 and purging of genetic load.

467
Taken together, our results point to the continued impor-468 tance of wild alleles in cassava, one of the most important 469 staple foods in the developing world, and a model for other 470 clonally-propagated root and tuber crops. We present an 471 example of both the benefits and consequences of historical 472 introgression for modern crop breeding. Our methods and 473 the breeding implications we highlight will therefore provide a 474 valuable example for other crops.