New Perspective on GWAS: East Asian Populations from the Viewpoint of Selection Pressure and Linear Algebra

Genome Wide Association Studies (GWAS) are useful for comparing the characteristics of different human population groups. However, genomes can change rapidly over time when there is a strong selection pressure, such as a pandemic. The genetic information related to the immune system is thought to be very sensitive to such diseases. Therefore, it may be necessary to conduct not only the standard whole-genome GWAS but also a more detailed, chromosome-focused GWAS. In this study, we compared chromosomes of immune system genes to those that are not thought to be related to the immune system, and analyzed GWAS results for SNPs in each chromosome to examine the differences. In order to keep the sample conditions as identical as possible, we limited the comparisons and the analyses to a few groups for which population movements were easy to interpret, and we also made sure the sample sizes were as close as possible. We selected a population of 403 East Asian people, consisting of 104 Japanese people in Tokyo (JPT), 103 Han Chinese people in Beijing (CHB) and 105 in southern China (CHS), and 91 Korean people (KOR). PCA and Manhattan plot were used to analyze and compare the results. Japanese, Chinese, and Korean populations formed distinctly different groups, with major differences observed. Validity of PCA and Manhattan plot was also discussed.


Background
Rapid advances in DNA sequencing technologies in recent years have led the International Human Genome Sequencing Consortium to announce the successful completion of the decoding of the human genome in 2003 [1]. Subsequently, a global whole-genome sequencing project (the 1000 Genomes Project) was initiated in 2008 with the goal of characterizing the genetic diversity of the world's population. As a result, genome information on more than 2,500 individuals worldwide is presently available as open data, including Japanese and Chinese individuals in East Asian populations [2][3]. More data are now available from other populations, including Koreans and Mongolians [4][5], which not been covered by the 1000 Genomes Project.
These data are being analyzed mainly by linear algebraic methods, such as Genome Wide Association Studies (GWAS), and are beginning to reveal genomic characteristics in European populations and details of human inter-and intra-regional migrations from the past to the present [7]. Recently, studies have been conducted not only in Europe but also in East Asian populations such as Japanese, Chinese, and Korean populations [8][9][10][11][12][13].

Issues of Previous Studies
Most of the previous GWAS that compared human groups have focused on whole genomes or on genes associated with specific infectious diseases. However, not many studies have analyzed these associations at the chromosomal level, even though it is chromosomes where the genetic recombination is actually occurring. There are also several linear algebraic issues.
First, GWAS on whole genomes do not usually assume that genes change significantly over time due to a strong selection pressure; however, the genomes of the immune system, such as Human Leukocyte Antigen (HLA) system, are expected to change significantly in response to pandemics. Therefore, when the results of Principal Component Analysis (PCA) on Single Nucleotide Polymorphisms (SNPs) obtained by GWAS are used to estimate group associations back in time, it is not always guaranteed that the data are consistent from the past to the present. This may reduce the accuracy of the analysis results.
Second, in the previous studies, the volumes of sample sizes of the groups to be compared were often different; PCA on SNPs, which is often employed in GWAS, has the property of determining the scale (principal components) so as to ensure that the total variance of all samples is maximized. Therefore, the principal components that serve as the scale for comparison of all groups will be optimized for the group with the largest sample size, which may not necessarily guarantee total optimization. Thus, even if the same groups are compared, it is not uncommon among studies with different sample sizes to yield different results.

Aim of This Study
To address the above issues, this study adopted the following alternative research approaches.
First, comparisons will be limited as much as possible to only a few groups for which population movement can be easily interpreted. Second, chromosomes with immune system genomes that are likely to change frequently will be selected, along with chromosomes that are not, and then the results of PCA of SNPs by GWAS for each chromosome will be compared.
Specifically, in the East Asian populations of Japanese, Chinese, and Korean subjects that are currently available as open data, chromosomes that contain genes closely related to infectious diseases and other chromosomes will be analyzed by PCA and Manhattan plot. The following advantages can also be expected from the data used in this study.

Genome Data Used
1) The geographical proximity and small number of groups make it relatively easy to verify the results of the analysis.
2) China has a long history, and events that may affect the genome (such as population movements, wars, and pandemics) are available in historical books and other written records, and can be compared with GWAS results.
3) Population sizes of these countries are relatively large (Japan 126m, China 1,380m, Korea 52m) and group members are stable.

Target Chromosomes
Based on the aforementioned reasons, the following chromosomes were selected for analysis, referring to the genome map created by the Japanese Ministry of Education, Culture, Sports, Science and Technology [15].
1) Chromosome 1, which is a common chromosome and is assumed to contain few genes related to the immune system.
2) Chromosome 6, which contains HLA genes, and is the center of the human immune system.
3) Chromosome 9, which contains the gene for ABO blood group, and has been studied in relation to many infectious diseases including COVID-19 [16][17].

4)
Chromosome 12, which contains ALDH2 (Human Aldehyde Dehydrogenase 2), the "alcohol-sensitivity"; information encoded on this chromosome tends to be similar among people in Japan, southern China, and Korea, while also being considered to be distinct from other human groups [18].

Analysis Methods
PCA was conducted for the above chromosomes, and the first and the second principal components were analyzed to calculate the Mahalanobis distances from JPT to CHB, CHS, and KOR. Similarly, analysis using Manhattan plot was conducted for each chromosome comparing the Japanese population to the others. The software used was plink v1. 9  Chromosome 6, which contains the genes of HLA, a major immune system, was divided into the first half, in which the genes of HLA are present, and the second half, in which they are absent. PCA was performed on each portion. The numbers of SNPs in the first half and the second half were set equal.

Chromosome 1
In the first and second principal components, where the differences were largest, the Japanese, Korean, and Chinese groups were clearly separated. The groups of Beijing (CHB) and southern China (CHS) also had their own characteristics, but as a whole they constituted one group (Fig. 1). Manhattan plot showed no noticeable difference (Fig. 2). Note: Dots above the solid red line were statistically significant.

Chromosome 6
The results of the PCA of the first half of the study, where the HLA genes are located, were almost identical for Japanese, Chinese, and Korean groups, with individual differences more significant than group differences (Fig. 3). The second half of the chromosome showed the distinct characteristics of the Japanese, Chinese, and Korean groups, but the differences were smaller than those at chromosome 1 (Fig. 4).

Fig. 4. PCA Result of Chromosome 6 (Second Half)
The Manhattan plot at chromosome 6 showed substantial differences in HLA positions, suggesting that the genes of HLA, a major immune system, had been significantly altered and mutated, resulting in extremely large differences between Japanese, Chinese, and Korean groups (Fig. 5).

Chromosome 9
As with chromosome 1, the Japanese, Chinese, and Korean groups were separated (Fig. 6), although the differences were smaller. Manhattan plot showed no noticeable difference in ABO position (Fig. 7). Note: Dots above the solid red line were statistically significant.

Chromosome 12
As with chromosome 1, the Japanese, Chinese, and Korean groups were separated.
However, the first and second principal component figures showed that three individual subjects had values that were far from their groups (Fig. 8). Manhattan plot showed no noticeable difference in ALDH2 position (Fig.9).

Effect of Selection Pressure
Looking at the first and second principal components of PCAs, Mahalanobis distances, which indicate differences in human groups, were considerably smaller in the first half of chromosome 6, which contains HLA, the core of the human immune system where natural selection pressure was strong, in comparison with other chromosomes (Table 1). In the other parts, such as the second half of chromosome 6, the Mahalanobis distance was not very different from that of chromosome 1. These values suggest that natural selection pressure affected the second half of chromosome 6 less than they affected the first half. In general, Mahalanobis distances from JPT (Tokyo, Japan) were also consistent with the physical distances from KOR (Korea), CHB (Beijing, China), and CHS (southern China), in that order. From the results above, it can be inferred that natural selection by infectious diseases does not have a "positive" effect on specific genes, but rather a "negative" effect. In other words, genes change in the direction of larger diversity. This implies that individuals with genes that make them more susceptible to infectious diseases are more likely to go extinct, as there is a greater likelihood that they will decline rapidly since they will not be able to pass their genes on to the next generation. The large genetic variation probably means that it is advantageous for the survival of the species for the immune system to diversify, rather than the "survival of the fittest." This is similar to the situation with resistant bacteria or COVID-19; the latter is still mutating to escape vaccines.
It is estimated that Japan, being an island nation, has a relatively small population influx from outside compared to continental nations such as China and Korea. However, in light of the above, it may not always be a meaningful comparison to consider how much of the ancient Japanese (Jomon) genes are inherited by modern Japanese people. Japan has been separated from the Asian continent since about 20,000 years ago, and from that point until about 3,000 years ago (the Jomon period), Japan was not an agricultural society. Later (the Yayoi period), with the arrival of paddy rice cultivation, the society transformed into an agricultural one, the staple food changed to rice, intense infectious diseases such as tuberculosis became prevalent, and the environment changed drastically [21]. Therefore, it is assumed that among the Jomon people, those individuals who could not cope with such changes died without children. Therefore, it is highly likely that the modern Japanese people who have the genes to cope with infectious diseases are considerably different from those of the Jomon people, who did not live in an agricultural society until 3,000 years ago.

Influence of Paddy Rice Cultivation
The "low tolerance for alcohol" gene, which is said to have originated in the Yangtze River region, may be another example [20]. More than 6,000 years ago, many people began to gather and live near the Yangtze River floodplain, which was suitable for rice cultivation. At that time, because of a poor sanitary environment, food was often contaminated with harmful microbes and other substances that could cause infectious diseases. At such a time, alcoholic beverages made from rice were thought to be useful.
When people with a low tolerance for alcohol, or weak acetaldehyde decomposition gene (ALDH2), drank alcoholic beverages, the level of acetaldehyde, a highly poisonous substance that cannot be decomposed, would increase in his or her body. However, it appears that the poison might have also served as a drug that attacked harmful microbes. On the other hand, people without this weak gene had less acetaldehyde in their bodies and could not suppress those harmful microorganisms. Thus, people with the "low tolerance for alcohol" gene were more likely to survive and overcome infectious diseases.
In other words, it is possible that people in rice paddy farming areas felt selection pressure to develop a low tolerance for alcohol in order to protect themselves from infectious diseases.
This "low tolerance for alcohol" gene, ALDH2, was eventually introduced to the Japanese archipelago along with the rice culture; over 40% of the present Japanese population has this gene. It is also thought that many ancient Japanese before rice paddy cultivation (Jomon people) did not have this gene.
Tuberculosis (TB) is another infectious disease thought to have arrived in Japan along with rice paddy cultivation [21]. People with blood types B and AB have the same type B antigen as Mycobacterium Tuberculosis, making it difficult for their immune system to function and making them susceptible to infection [23]. On the other hand, people with blood types A and O, which do not carry the type B antigen, are less susceptible to TB infection. In East Asia, paddy rice cultivation is prevalent in Japan, southern China, and Korea, where types A and O tend to be more common than types B and AB [24]. However, due to improved sanitary conditions in today's East Asia, it would be difficult to substantiate the above hypotheses.

Interpretation of PCA and Manhattan Plot
According to the PCA and the Manhattan plot for each chromosome conducted in this study, the results seem to vary considerably depending on the selection pressure, human group selection methods, and sample sizes.
For example, on chromosome 6, which contains the HLA gene, the Mahalanobis distances for Japanese, Chinese, and Koreans are relatively close compared to other chromosomes according to PCA analysis (Table 1). However, Manhattan plot analysis results show that the differences between Japanese and Chinese or Japanese and Korean SNPs are larger than the other chromosomes (Fig. 5), which were the exact opposite.
In addition, ALDH2, the "low tolerance for alcohol" gene commonly found in Japan, southern China, and Korea, is said to differ more from other human groups [15], but significant differences were not found in this study (Figs. 6-7). This may be because this gene is not rare in Japan, southern China, or Korea.
In light of the above, when making comparisons among human groups, sufficient attention should be paid not only to sample selection methods and sample sizes, but also to the linear algebraic nature of PCA and Manhattan plot. A comprehensive perspective will also be required when interpreting the results.

Conclusion
GWAS is useful for comparing characteristics of human groups. However, genomes may change rapidly over time when selection pressures, such as environmental changes, are strong. In particular, the immune system seems to be very sensitive to environmental changes. Therefore, there will be a need to perform GWAS not only on whole genomes, but also at the level of individual chromosomes when necessary.
Japanese, Chinese, and Korean people form distinctly different groups genetically.
However, the sample size for this study is small (403 individuals), the targeted samples are limited to the East Asian population, and only a few basic methods were used for statistical analysis. Studies with a larger global dataset and methodological innovations will be needed to get one step closer to the truth.