## Abstract

We consider the network of 5 416 537 articles of English Wikipedia extracted in 2017. Using the recent reduced Google matrix (REGOMAX) method we construct the reduced network of 230 articles (nodes) of infectious diseases and 195 articles of world countries. This method generates the reduced directed network between all 425 nodes taking into account all direct and indirect links with pathways via the huge global network. PageRank and CheiRank algorithms are used to determine the most influential diseases with the top PageRank diseases being Tuberculosis, HIV/AIDS and Malaria. From the reduced Google matrix we determine the sensitivity of world countries to specific diseases integrating their influence over all their history including the times of ancient Egyptian mummies. The obtained results are compared with the World Health Organization (WHO) data demonstrating that the Wikipedia network analysis provides reliable results with up to about 80 percent overlap between WHO and REGOMAX analyses.

## Introduction

*Infectious diseases account for about 1 in 4 deaths worldwide*, *including approximately two-thirds of all deaths among children younger than age 5* (** NIH, 2018**). Thus the understanding of the world influence of infectious diseases is an important challenge. Here we apply the mathematical statistical methods originated from computer and network sciences using the PageRank and other Google matrix algorithms developed at the early stage of search engines development (

**). These methods are applied to English Wikipedia edition which is considered as a directed network generated by hyperlinks (citations) between articles (nodes). Nowadays, the free online encyclopedia supersedes old ones such as**

*Brin and Page, 1998*;*Langville and Meyer, 2012***in volume and in quality of articles devoted to scientific topics (**

*Encyclopaedia Britannica (2018)***). For instance, Wikipedia articles devoted to biomolecules are actively maintained by scholars of the domain (**

*Giles, 2005***). The academic analysis of information contained by Wikipedia finds more and more applications as reviewed in**

*Butler, 2008*;*Callaway, 2010*

*Reagle Jr. (2012)*;*Nielsen (2012)*.The Google matrix analysis, associated to the PageRank algorithm, initially invented by Brin and Page to efficiently rank pages of the World Wide Web (** Brin and Page, 1998**), allows to probe the network of Wikipedia articles in order to measure the influence of every articles. The efficiency of this approach for Wikipedia networks has been demonstrated by ranking historical figures on a scale of 35 centuries of human history and by ranking world universities (

**). This approach produced also reliable results for the world trade during last 50 years reported by the UN COMTRADE database and other directed networks (**

*Zhirov et al., 2010*;*Eom et al., 2015*;*Lages et al., 2016*;*Katz and Rokach, 2017*;*Coquidé et al., 2018***).**

*Ermann et al., 2015*Recently, the reduced Google matrix method (REGOMAX) has been proposed using parallels with quantum scattering in nuclear physics, mesoscopic physics, and quantum chaos (** Frahm and Shepelyansky, 2016; Frahm et al., 2016**). This method allows to infer hidden interactions between a set of

*n*nodes selected from a huge network taking into account all indirect pathways between these

_{r}*n*nodes via the huge remaining part of the network. The efficiency of REGOMAX has been demonstrated for analysis of world terror networks and geopolitical relations between countries from Wikipedia networks (

_{r}**,b). The efficient applications of this approach to the global biological molecular networks and their signaling pathways are demonstrated in (**

*El Zant et al., 2018a***).**

*Lages et al., 2018*In this work we use REGOMAX method to investigate the world influence and importance of infectious diseases constructing the reduced Google matrix from English Wikipedia network with all infectious diseases and world countries listed there.

The paper is constructed as follows: the data sets and methods are described in Section II, Results are presented in Section III and discussion is given in Section IV; Appendix contains Tables 1, 2, 3, 4, 5; additional data are presented at ** Wiki4InfectiousDiseases (2018)**.

## Description of data sets and methods

### English Wikipedia Edition network

We consider the English language edition of Wikipedia collected in May 2017 (** Frahm and Shepelyansky, 2017**) containing

*N*= 5 416 537 articles (nodes) connected through

*n*= 122 232 932 hyperlinks between articles. From this data set we extract the

_{l}*n*= 230 articles devoted to infectious diseases (see Tab. 1, Tab. 2) and the

_{d}*n*= 195 articles devoted to countries (sovereign states, see Tab. 3). The list of infectious diseases is taken from

_{c}**and the list of sovereign states of 2017 is taken from**

*Wikipedia (2018d)***. Thus the size of the reduced Google matrix is**

*Wikipedia (2017)**n*=

_{r}*n*+

_{d}*n*= 425. This subset of

_{c}*n*articles is embedded in the global Wikipedia network with

_{r}*N*nodes. All data sets are available at

**.**

*Wiki4InfectiousDiseases (2018)*### Google matrix construction

The construction of Google matrix *G* is described in detail in ** Brin and Page (1998)**;

**);**

*Langville and Meyer (2012***. In short, the Google matrix**

*Ermann et al. (2015)**G*is constructed from the adjacency matrix

*A*with elements 1 if article (node)

_{ij}*j*points to article (node)

*i*and zero otherwise. The Google matrix elements take the standard form

*G*=

_{ij}*αS*+ (1 –

_{ij}*α*)/

*N*(

**;**

*Brin and Page, 1998*;*Langville and Meyer, 2012***), where**

*Ermann et al., 2015**S*is the matrix of Markov transitions with elements

*S*=

_{ij}*A*/

_{ij}*k*(

_{out}*j*). Here

*k*(

_{out}*j*) = is the out-degree of node

*j*(number of outgoing links) and

*S*= 1/

_{ij}*N*if

*j*has no outgoing links (dangling node). The parameter 0 <

*α*< 1 is the damping factor. For a random surfer, jumping from one node to another, it determines the probability (1 –

*α*) to jump to any node; below we use the standard value

*α*= 0.85 (

**).**

*Langville and Meyer, 2012*The right eigenvector of *G* satisfies the equation *GP* = *λP* with the unit eigenvalue *λ* = 1. It gives the PageRank probabilities *P (j*) to 1nd a random surfer on a node *j* and has positive elements Σ* _{j} P (j*) = 1). All nodes can be ordered by decreasing probability

*P*numbered by PageRank index

*K*= 1, 2, …

*N*with a maximal probability at

*K*= 1 and minimal at

*K*=

*N*. The numerical computation of

*P (j*) is efficiently done with the PageRank algorithm described in

**;**

*Brin and Page (1998)***.**

*Langville and Meyer (2012)*It is also useful to consider the network with inverted direction of links. After inversion the Google matrix *G*^{∗} is constructed within the same procedure with *G*^{∗}*P* ^{∗} = *P* ^{∗}. This matrix has its own PageRank vector *P* ^{∗}(*j*) called CheiRank (** Chepelianskii, 2010**) (see also

**). Its probability values can be again ordered in a decreasing order with CheiRank index**

*Zhirov et al. (2010)*;*Ermann et al. (2015)**K*

^{∗}with highest

*P*

^{∗}at

*K*

^{∗}= 1 and smallest at

*K*

^{∗}=

*N*. On average, the high values of

*P (P*

^{∗}) correspond to nodes with many ingoing (outgoing) links (

**).**

*Ermann et al., 2015*### Reduced Google matrix analysis

Reduced Google matrix is on-structed for a selected subset of nodes (articles) following the method described in ** Frahm and Shepelyansky (2016); Frahm et al. (2016); Lages et al. (2018)**. It is based on concepts of scattering theory used in different 1elds including mesoscopic and nuclear physics, and quantum chaos (see Refs. in

**). It captures in a**

*Frahm and Shepelyansky (2016)**n*-by-

_{r}*n*Perron-Frobenius matrix the full contribution of direct and indirect interactions happening in the full Google matrix between the

_{r}*n*nodes of interest. Also the PageRank probabilities of selected

_{r}*n*nodes are the same as for the global network with

_{r}*N*nodes, up to a constant multiplicative factor taking into account that the sum of PageRank probabilities over

*n*nodes is unity. The elements of reduced matrix

_{r}*G*

_{R}(

*i, j*) can be interpreted as the probability for a random surfer starting at web-page

*j*to arrive in web-page

*i*using direct and indirect interactions. Indirect interactions refer to paths composed in part of web-pages different from the

*n*ones of interest. The intermediate computation steps of

_{r}*G*

_{R}offer a decomposition of

*G*

_{R}into matrices that clearly distinguish direct from indirect interactions:

*G*

_{R}=

*G*

_{rr}+

*G*

_{pr}+

*G*

_{qr}(

**). Here**

*Frahm et al., 2016**G*

_{rr}is given by the direct links between selected

*n*nodes in the global

_{r}*G*matrix with

*n*nodes. In fact,

*G*

_{pr}is rather close to the matrix in which each column is given by the PageRank vector

*P*, ensuring that PageRank probabilities of

_{r}*G*

_{R}are the same as for

*G*(up to a constant multiplier). Thus

*G*

_{pr}doesn’t provide much information about direct and indirect links between selected nodes. The component playing an interesting role is

*G*

_{qr}, which takes into account all indirect links between selected nodes appearing due to multiple paths via the global network nodes

*N*(see

**;**

*Frahm and Shepelyansky (2016)***). The matrix**

*Frahm et al. (2016)**G*

_{qr}=

*G*

_{qrd}+

*G*

_{qrnd}has diagonal (

*G*

_{qrd}) and non-diagonal (

*G*

_{qrnd}) parts. Thus

*G*

_{qrnd}describes indirect interactions between nodes. The explicit formulas as well as the mathematical and numerical computation methods of all three components of

*G*

_{R}are given in

**;**

*Frahm and Shepelyansky (2016)***;**

*Frahm et al. (2016)***.**

*Lages et al. (2018)*After obtaining the matrix *G*_{R} and its components we can analyze the PageRank sensitivity in respect to specific links between *n _{r}* nodes. To measure the sensitivity of a country

*c*to a disease

*d*we change the matrix element

*G*

_{R}(

*d*→

*c*) by a factor (1 +

*δ*) with

*δ*≪ 1, we renormalize to unity the sum of the column elements associated with disease

*d*, and we compute the logarithmic derivative of PageRank probability

*P (c*) associated to country

*c*:

*D*(

*d*→

*c, c*) =

*d*ln

*P (c*)/

*dδ*(diagonal sensitivity). It is also possible to consider the nondiagonal (or indirect) sensitivity

*D*(

*d*→

*c, c*’) =

*d*ln

*P (c*’)/

*dδ*when the variation is done for the link from

*d*to

*c*and the derivative of PageRank probability is computed for another country

*c*’. This approach was already used in

**showing its efficiency.**

*El Zant et al. (2018a,b)*## Results

### Network of direct links

For the reduced Google matrix analysis we have *n _{r}* = 425 selected nodes of countries (195) and infectious diseases (230). The diseases are attributed to 7 groups corresponding to the standard disease types as it is given in Tab. 1, Tab. 2. These

*n*nodes constitute a subnetwork embedded in the huge global English Wikipedia network with more than 5 million nodes. This subnetwork is shown in Fig. 1 which has been generated with Cytoscape software (

_{r}**). In Fig. 1 black arrow links represent the nonzero elements of adjacency matrix between the selected**

*Shannon et al., 2003**n*nodes. The image of this adjacency matrix is shown in Fig. 2 where white pixels depicted a link between two nodes. In this picture, nodes are ordered with respect to the PageRank order in each subgroup: American countries, European countries, Asian countries, African countries, Oceanian countries, bacterial diseases, viral disease, parasitic diseases, fungal diseases, multiple origins diseases, prionic diseases and other kind of disease origins. There are visibly more links inside subgroups but links between groups are also significant. Fig. 1 gives us the global view of network of direct links, shown in Fig. 2, corresponding to the component

_{r}*G*

_{rr}of the reduced Google matrix. We see that countries are located in the central part of the network of Fig. 1 since they have many ingoing links. While it is useful to have such a global view it is clear that it does not take into account the indirect links appearing between

*n*nodes due to pathways via the complementary network part with a huge number of nodes

_{r}*N*–

*n*≃

_{r}*N*. The indirect links emerging between

*n*from this indirect pathways are analyzed in the frame of REGOMAX method below.

_{r}### PageRank and CheiRank of the reduced network nodes

At first we compute the PageRank and CheiRank probabilities for the global network with *N* nodes attributing to each node PageRank and CheiRank indexes * K* and

**K**^{∗}. For selected

*n*nodes the results of PageRank are shown in Tab. 2. As usual (see

**), the countries are taking the top PageRank positions with US, France, Germany, etc at**

*Zhirov et al. (2010)*;*Ermann et al. (2015)**= 1, 2, 3, etc as shown in Tab. 3. In the list of*

**K***n*nodes the infectious diseases start to appear from

_{r}*= 106 with Tuberculosis (Tab. 2). If we consider only infectious diseases ordered by their disease PageRank index*

**K***then we obtain at the top Tuberculosis, HIV/AIDS, Malaria, Pneumonia, Smallpox at first positions with*

**K**_{d}*= 1, 2, 3, 4, 5 (see Tab. 2). It is clear that PageRank order gives at the top positions severe infectious diseases which are (were for Smallpox) very broadly spread worldwide.*

**K**_{d}In Fig. 3 we show the location of selected *n _{r}* nodes on the global (

*K, K*^{∗}) plane of density of Wikipedia articles (see details of this representation in

**;**

*Zhirov et al. (2010)***). Here the positions of countries are shown by white circles and diseases by color circles. The countries are taking the top positions since they have many ingoing links from variety of other articles. The infectious diseases are located on higher values of**

*Ermann et al. (2015)*

*K, K*^{∗}even if some diseases are overlapping with the end list of countries (see Tab. 2).

All *n _{r}* = 425 selected articles can be ordered by their local PageRank and CheiRank indexes

*and*

**K**_{r}

**K**_{r}^{∗}which range from 1 to

*n*= 425. Their distribution in the local PageRank-CheiRank plane is shown in Fig. 4. As discussed previously, countries are at the top

_{r}

**K**_{r},**K**_{r}^{∗}positions. The names of top PageRank diseases are marked on the figure. The most communicative articles of infectious diseases are those with top

**K**_{r}^{∗}positions. Thus the top CheiRank disease is Burkholderia due to many outgoing links present in this article. The next ones are Malaria and HIV/AIDS.

### Reduced Google matrix

To study further the selected subset of 425 nodes we use the reduced Google matrix approach and compute numerically *G*_{R} and its three components *G*_{pr}, *G*_{rr}, *G*_{qr}. It is convenient to characterize each component by its weight de1ned as the sum of all elements divided by the matrix size *n _{r}*. By definition we have the weight

*W*= 1 for

_{R}*G*

_{R}and we obtain weights

*W*= 0.91021,

_{pr}*W*= 0.04715,

_{rr}*W*= 0.04264 (with nondiagonal weight

_{qr}*W*= 0.02667) respectively for

_{qrnd}*G*

_{pr},

*G*

_{rr},

*G*

_{qr}(

*G*

_{qrnd}). The weight of

*G*

_{pr}is signi1cantly larger than others but this matrix is close to the matrix composed from equal columns where the column is the PageRank vector (see also discussions in

**;**

*Frahm et al. (2016)***). Due to this reason the components**

*El Zant et al. (2018a)*;*Lages et al. (2018)**G*

_{rr}and

*G*

_{qr}provide an important information about interactions of nodes. Since the weights of these two components are approximately equal we see that the direct and indirect (hidden) links have a comparable contribution.

As an illustration we show in Fig. 5 a close up on African countries and viral diseases sectors of the full *n _{r}* ×

*n*reduced Google matrix

_{r}*G*

_{R}is shown (there are 55 African countries and 60 viral diseases shown in fig5). Detailed presentations of the

*G*

_{R}matrix components for the complete subset of

*n*= 425 countries and infectious diseases are given in

_{r}**. In Fig. 5,**

*Wiki4InfectiousDiseases (2018)**G*

_{R}and its components are composed of diagonal blocks corresponding to country → country and disease → disease effective links, and off-diagonal blocks corresponding to disease → country (upper off-diagonal block) and country → disease (lower off-diagonal block) effective links.

### Friendship network of nodes

We use the matrix of direct and indirect transition *G*_{rr} + *G*_{qr} to determine the proximity relations between 230 nodes corresponding to diseases and to all *n _{r}* = 425 nodes of diseases and countries. We call this the friendship networks being shown in Fig. 6. For each of 7 disease groups (see Tab. 2) we take a group leader as a disease with highest PageRank probability inside the group. Then on each step (level) we take 2 best friends define them as those nodes to which a leader has two highest transition matrix elements of

*G*

_{rr}+

*G*

_{qr}. This gives us the second level of nodes below the 7 leaders. After that we generate the third level keeping again two better friends of the nodes of second level (those with highest transition probabilities). This algorithm is repeated until no new friends are found and the algorithm stops. In this way we obtain the network of 17 infectious diseases shown in the top panel of Fig. 6. The full arrows show the proximity links between disease nodes. The red arrows mark links with dominant contribution of

*G*

_{qr}indirect transitions while the black ones mark the links with dominance of

*G*

_{rr}direct transitions. Full arrows are for transitions from group leaders to nodes of second level, etc (see Fig. 6 caption for details). The obtained network is drawn with the Cytoscape software (

**).**

*Shannon et al., 2003*In Fig. 6 top panel, four of the leader diseases are well connected; nodes corresponding to Tuberculosis (bacterial disease), HIV/AIDS (viral disease), Malaria (parasitic disease), Pneumonia (multiple origin disease) have 6 or more degrees. Nodes of Creutzfeldt-Jakob disease (prion disease) and of Desmodesmus (other origin disease) are more isolated. Focusing on first level friendship links (solid arrows), we retrieve several well known interactions between infectious diseases such as:

the Tuberculosis–HIV/AIDS syndemic (see e.g.

) which is here represented by a closed loop between the two diseases,*Pawlowski et al., 2012*;*Kwan and Ernst, 2011*the interaction between HIV/AIDS and Syphilis (see e.g.

) which is a typical example of syndemic between AIDS and sexually transmitted diseases,*Karp et al., 2009*the Candidiasis interaction with HIV/AIDS, since the former is a very common opportunistic fungal infection for patients with HIV/AIDS (see e.g.

), and the Candidiasis interaction with Sepsis1 since e.g. invasive Candidiasis which in some rare cases can lead to fulminant sepsis with an associated mortality exceeding 70% (see e.g.*Armstrong-James et al., 2014*),*Pappas et al., 2018*the Pneumonia to Sepsis interaction or the Malaria to Sepsis interaction, the first interaction reflects the fact that Sepsis is one of the possible complications of Pneumonia, the second interaction reflects that symptoms of Malaria resemble to those of Sepsis

,*Auma et al. (2013)*the closed loop interaction between Tuberculosis and Leprosy reflecting that these two diseases are caused by two different species of mycobacteria (see e.g.

),*Wikipedia, 2018e*the relation between Pneumonia and Tuberculosis, two severe pulmonary diseases (see e.g.

),*WHO, 2018*the Creutzfeldt–Jakob disease pointing to Pneumonia since patient infected by this prion disease develop a fatal Pneumonia due to impaired coughing reflexes (see e.g.

),*Al Balushi et al., 2016*the closed loop interaction between Kuru and Creutzfeldt–Jakob diseases since these two diseases are representatives of transmissible spongiform encephalopathies (see e.g.

).*Sikorska and Liberski, 2012*Taking into account also the other 2nd to 4th friendship levels, peculiar features appear such as:

the cluster of bacterial diseases Tuberculosis–Leprosy–Syphilis; since there were a confusion between Leprosy and Syphilis in diagnosis before XXth century (see e.g.

;*Leprosy and Syphilis, 1890*), and false positives can occur with Tuberculosis for patients with Syphilis (see e.g.*Syphilis and Leprosy, 1899*),*Shane, 2006*the mosquito diseases cluster grouping Malaria, Yellow fever, Dengue fever and West Nile fever,

the Meningitis–Sepsis closed loop since the Sepsis is usually developed at early stage by patient with Meningitis (see e.g.

).*CDC, 2018*

Red arrows in Fig. 6 indicate pure indirect links between infectious diseases: Desmodesmus and Malaria are both waterborne diseases (see e.g. ** Wikipedia, 2018f**), Desmodesmus and HIV/AIDS are related by a Wikipedia page devoted to immunocompetence (see e.g.

**), and Kuru in Papua New Guinea Foré language possibly means to shiver from cold (see e.g.**

*Wikipedia, 2018b***).**

*Liberski and Ironside, 2015*From the above analysis we observe that the wiring between infectious diseases is meaningful guaranteeing that information encoded in the reduced Google matrix *G*_{R}, and more precisely in its *G*_{rr} + *G*_{qr} component, is reliable, and can be used to infer possible relations between infectious diseases and any other subjects contained in Wikipedia such as e.g. countries, drugs, proteins, etc.

In the bottom panel of Fig. 6 we analyze the proximity between diseases of top panel with the world countries. Thus we add the two better “friend” countries being those to which a given disease has most strong matrix elements in *G*_{rr} + *G*_{qr} (there is no next iterations for country nodes). The friend countries (or proximity countries) are Egypt and Swaziland for Tuberculosis; Cameroon and Cote d’Ivoire for HIV/AIVS; Peru and Thailand for Malaria; Liberia and Uganda for Pneumonia; United Kingdom (UK)^{2} and Papua New Guinea for Creutzfeldt-Jakob disease and others. These strong links from an infectious disease to a given country well correspond to known events involving a disease and a country, like e.g. UK and Creutzfeldt-Jakob disease. We will see this in a more direct way using the sensitivity analysis presented in the next subsection.

### World country sensitivity to infectious diseases

We also preform analysis of the sensitivity *D*(*d* → *c, c*’) of country node *c*’ to the variation of the link *d* → *c*, where *d* denotes a disease node and *c* a country node. The diagonal sensitivity *D*(*d* → *c, c*) of world countries to Tuberculosis and HIV/AIDS are shown in Fig. 7. The most sensitive countries to Tuberculosis are Swaziland (SZ), Egypt (EG) and New Zealand (NZ). Indeed, in 2007 SZ had the highest estimated incidence rate of Tuberculosis as it is described in the corresponding Wikipedia article. Egypt also appears in this article since tubercular decay has been found in the spine of Egyptian mummy kept in the British Museum. NZ is present in this article since this country had a relatively successful effort to eradicate bovine tuberculosis. Thus Tuberculosis has direct links to these three countries (in agreement with two close country friends shown in the network of Fig. 6) that results in their high sensitivity to this disease. Of course, the origins of this sensitivity are different for SZ, EG, NZ. Thus Wikipedia network integrates all historical events related to Tuberculosis including ancient Egyptian mummy and recent years of high incidence rate in SZ. It can be discussed how important are these rather different types of links between disease and counties. Of course, a simplified network view cannot take into account all richness of historical events and describe them by a few number of links. However this approach provides a reliable global view of the interactions and dependencies between a disease and world countries.

The sensitivity of countries to HIV/AIDS is shown in the top panel of Fig. 7. The most sensitive countries are Botswana (BW), Senegal (SN) and Cote d’Ivoire (CI). This happens since HIV/AIDS article directly points that estimated life expectancy in BW dropped from 65 to 35 years in 2006; SN and CI appears since the closest relative of HIV-2 exists in monkey living in coastal West Africa from SN to CI. The friendship network in Fig. 6 indeed marks the countries close to HIV/AIDS as CI and Cameroon (CM). The sensitivity map in Fig. 7 also shows that CM has high sensitivity to HIV/AIDS since HIV-1 appears to have originated in southern Cameroon.

The case of two diseases considered in Fig. 7 demonstrates that the REGOMAX approach is able to reliably determine the sensitivity of world countries to infectious diseases taking into account their relations on a scale of about 3 thousands of years.

A part of the sensitivity relations between diseases and countries can be visible from the friendship network as those in the bottom panel of Fig. 6. However, the REGOMAX approach can handle also indirect sensitivity *D*(*d* → *c, c*’) which is rather hard to be directly extracted from the friendship network. The examples of indirect sensitivity are shown in Fig. 8. Thus the variation of link from HIV/AIDS to Cameroon (CM) (Fig. 8 bottom panel) mainly affects Equatorial Guinea (GQ), Central African Republic (CF) and Chad (TD). The variation of link HIV/AIDS to USA (Fig. 8 top panel) produces the strongest sensitivity for Federal States of Micronesia (FM), Marshall Islands (MH) and Rwanda (RW). These countries are not present in the Wikipedia article HIV/AIDS and the obtained sensitivity emerges from a complex network interconnections between HIV/AIDS, USA (or Cameroon) to these countries. Thus the REGOMAX analysis allows to recover all network complexity of direct and indirect interactions between nodes.

### Comparison of REGOMAX and WHO results

It is important to compare the results of REGOMAX analysis with those of World Health Organization (WHO) or other sources on number of infected people. With this aim we extract from WHO reports (** WHO, 2008, 2018**) the number of Tuberculosis incidences per 100000 population of a given country, and the number of new HIV/AIDS infections per 1000 uninfected population. The ranking of countries by the number of incidences is presented in Tab. 4 for Tuberculosis and the ranking of countries by new infections is presented in Tab. 5 for HIV/AIDS. We also analyze WHO 2004 data (

**) for the number of deaths caused by each disease in 2004 complemented by the data reported in**

*WHO, 2008***. The resulting ranking of diseases is given in Tab. 6.**

*Wikipedia (2018a)*We compare these offcial WHO ranking results with those obtained from REGOMAX analysis. Thus we determine the ranking of countries by their sensitivity to Tuberculosis and to HIV/AIDS for top 100 countries (these ranking lists are given in ** Wiki4InfectiousDiseases (2018)**). The overlap

*η*(

*j*) of these REGOMAX rankings with those of WHO from Tab. 4 and Tab. 5 are shown in Fig. 9. We obtain the overlap of 50% for Tuberculosis and 79% for HIV/AIDS for the top 100 countries. These numbers are comparable with overlaps obtained for top 100 historical figures found from Wikipedia and historical analysis (see

**) and for top 100 world universities determined by Wikipedia and Shanghai ranking (see**

*Eom et al. (2015)***).**

*Lages et al. (2016)*;*Coquidé et al. (2018)*Another comparison is presented in Fig. 10. Here we take the infectious diseases ordered by their PageRank index and compare them with the ranking list of diseases ordered by the number of deaths caused by them in 2004 (see Tab. 6). The obtained overlap is shown in Fig. 10 with an overall of 100% for top 4 deadliest diseases (which from Tab. 6 are 1 Pneumonia, 2 HIV/AIDS, 3 Tuberculosis, 4 Malaria) and 54% for the whole list of 31 considered diseases. In addition the REGOMAX analysis allows to determine the world map of countries with the highest sensitivity to the top list of 7 diseases. Such maps are shown in Fig. 11. We see the world dominance of Tuberculosis (top panel) while after it (or in its absence) we 1nd the world dominance of HIV/AIDS (bottom panel). The geographical influence of Malaria and Diarrheal diseases are also well visible.

Thus the performed comparison with WHO data shows that the Google matrix analysis of Wikipedia network provides us a reliable information about world importance and influence of infectious diseases.

## Discussion

In this work we presented the reduced Google matrix (or REGOMAX) analysis of world influence of infectious diseases from the English Wikipedia network of 2017. This method allows to take into account all direct and indirect links between the selected nodes of countries and diseases. The importance of diseases is determined by their PageRank probabilities. The REGOMAX analysis allows to establish the network of proximity (friendship) relations between the diseases and countries. The sensitivity of world countries to a specific disease is determined as well as the influence of link variation between a disease and a country on other countries. The comparison with the WHO data confirms the reliability of REGOMAX results applied to Wikipedia network.