Factors that influence scientific productivity from different countries: A causal approach through multiple regression using panel data

The main purpose of the economic expenditure of countries in research and development is to achieve higher levels of scientific findings within research ecosystems, which in turn could generate better living standards for society. Therefore, the collection of scientific production constitutes a faithful image of the capacity, trajectory and scientific depth assignable to each country. The intention of this article is to contribute to the understanding of the factors that certainly influence in the scientific production and how could be improved. In order to achieve this challenge, we select a sample of 19 countries considered partners in science and technology. On the one hand we download social and economic variables (gross domestic expenditure on R&D (GERD) as a percentage of gross domestic product (GDP) and researchers in full-time equivalent (FTE)) and on the other hand variables related to scientific results (total scientific production, scientific production by subject areas and by different institutions, without overlook the citations received as an impact measure) all this data within a 17-year time window. Through a causal model with multiple linear regression using panel data, the experiment confirms that two independent (or explanatory) variables of five selected explain the amount of scientific production by 98% for the countries analyzed. An important conclusion that we highlight stays the importance of checking for compliance of statistical assumptions when using multiple regression in research studies. As a result, we built a reliable predictive model to analyze scenarios in which the increase in any of the independent variables causes a positive effect on scientific production. This model allows decision maker to make comparison among countries and helps in the formulation of future plans on national scientific policies.

143 Nevertheless, to date, no single criterion could identify the influencing determinants because scientific production 144 could be altered by other imperceptible external factors. In order to contribute to the collection of studies concerned 145 about the behaviour of scientific production we have elaborated a regression model using panel data to examine 146 the scientific production of strategic countries through a succession of influential variables at the same time.

147
148 We selected a sample of 19 countries considered partners in science and technology with data from 17 year time 149 window. The originality and main advantage of this study is to apply multiple regression using panel data to a 150 diverse set of countries considered strategic partners in science and technology. Through panel data we can capture 151 unobservable heterogeneity, either between economic agents or studies over time, because heterogeneity cannot 152 be detected by time-series or cross-sectional studies. Multiple regression with panel data enables more dynamic 153 analysis by incorporating the temporal dimension, which enriches the study, particularly in periods of significant 154 and multiple changes. Panel data models are frequently used in statistic and econometric studies. Multiple 155 regression enables analysis of two important aspects when working with panel data that form part of the 156 unobservable heterogeneity, i.e., specific individuals and temporary effects. This technique allows researchers to 157 have a greater number of observations, improving of information quality and efficiency because increasing the 158 sample size, we obtain more information about the population and, consequently, the degrees of freedom increase 159 [37].
160 In order to choose the explanatory variables, we followed an empirical procedure based on previous statistical 161 experiments with scientific production variable. Beforehand we selected a large number of variables which we 162 consider influencing scientific production however they were ultimately reduced by statistical procedures. Note 163 that even without an empirical procedures we could have affirmed that some external variables can influence 164 scientific production by research logic and observation of this phenomenon. 165 The variables chosen are the following: the most important is expenditure in research, without this economic input 166 we could not do anything in science. The second one is the human labor 'researchers', is also something reasonable, 167 since they represent the human force to perform research production. Moreover, the third is the countries' research 168 preference measured by scientific production in disciplines. The fourth variable is the higher education and 169 research institutions because they are responsible for hosting the scientific processes playing an enormous 170 importance in the production. Finally, citations were selected as another exogenous variable. Citations motivate 171 researchers to continue producing, collaborating, or developing a specific productive research line-when a group 172 of researchers collaborate and these collaborations are successful with respect to the impact received and citations, . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/558254 doi: bioRxiv preprint first posted online Feb. 22, 2019; 173 this motives to continue collaborating. Consequently citations is a variable that encourages collaboration and with 174 collaboration increases the scientific production.
175 Therefore, we have chosen the variables which consider the most relevant for our study. We believe that could 176 make governments reconsider their scientific policies when assigning their economic resources. 177 A number of research questions are formulated: Could the selected variables together explain the scientific 178 production behaviour? If governments increase investment in science, by 1%, would scientific production 179 increase? Would increasing the number of researchers improve scientific production? Are a small group of 180 institutions responsible for increasing scientific production? Is a concentrated number of disciplines responsible 181 for increased scientific production? Does the total number of citations motivate researchers to continue publishing? 182 Our working hypothesis is as follows: variations in scientific production can be explained by the previously 183 identified descriptive variables (economic expenditure, researchers, research preference, academic and research 184 institutions, citation received) considering that these variables show dynamic behaviour over time. In other words, 185 we contend that a causal relationship between scientific production and exogenous variables can be captured by 186 regression analysis using panel data. 187 We are aware that investment in research is not allocated in an equal manner in all scientific areas, however we 188 have analyzed all countries with the same variables to make a comparison. In a future study we could dismember 189 the economic investment destined to the different areas and also account for the resulting production in those areas.
190 Similarly we would like to point out that investment in research is channeled and destined to different sectors, 191 however we consider that production can be seen as a way to materialize that economic injection. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/558254 doi: bioRxiv preprint first posted online Feb. 22, 2019; 228 its coverage is statistically balanced in terms of subjects, countries, languages, and publishers. We have also used 229 SciVal to extract the total number of citations received. 234 UNESCO takes the data from the OECD database. However, it is worth pointing out that many of strategic 235 countries do not belong to the OECD; therefore, to give consistency to the sources, we decided to use only the data 236 from UNESCO statistical database instead of OECD's data.

257
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/558254 doi: bioRxiv preprint first posted online Feb. 22, 2019; 258 GERD as a percentage of GDP: Gross domestic spending on R&D is defined as the total expenditure (current and 259 capital) on R&D by all resident companies, research institutes, university and government laboratories, etc., in a 260 country. It includes R&D funded from abroad but excludes domestic funds for R&D performed outside the 261 domestic economy. This indicator is measured in million USD and as a percentage of GDP [50]. Note that in this 262 context, GDP is defined as the sum of the gross value contributed by all resident producers in the economy, 263 including distributive trades and transport. Here GDP includes product taxes but does not include subsidies, i.e., 264 subsidies not included in the value of the products have been subtracted. This measure is defined to better 265 understand GERD/GDP; however, it has not been taken as an independent variable [50]. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/558254 doi: bioRxiv preprint first posted online Feb. 22, 2019; 288 Multiple linear regression at a significance level of 5% (α = 0.05) was used to determine causality between 289 scientific production and the predictor variables. Here the dependent variable (Y) represents scientific production 290 and the explanatory and independent variables (X) are GERD/GDP, researchers, A&RI, subject areas, and 291 citations. Since our sample combines temporal and transversal dimensions, the most adequate model to explain 292 causality can be obtained using panel data. This will allow us to analyze the general effect and observe the 293 individual outcome for each country in consideration of the influence of explanatory variables on the dependent 294 variable.
295 To measure this effect, we have also created binary dummy variables to quantify the effect of the country on the 296 scientific production variable, which takes a value of 1 when the analyzed country is present and 0 when the 297 country is not present.
298 When extracting data from the UNESCO database, we detected incomplete data for some countries (i.e., Israel, 299 South Africa, India, Colombia, Turkey, Brazil, and Chile); therefore, to maintain accuracy we decided not select 300 these countries in the present study. An analysis of factors influencing their scientific production could be 301 explained with other methods in future works. 302 As a result, a total of 12 countries with 204 observations for the 17-year period were considered in this study. 303 304 2.1 Statistical assumptions 305 306 We consider that checking the validity of statistical assumptions required by multiple regression using panel data 307 is a particular strength in this study. We deliberately decided to mention in this paper the variables that did not 308 comply with the statistical assumptions instead of removing them or choosing others. We noticed that in some 309 published studies where regression analysis is used, statistical assumptions are not mentioned, obviated, or only 310 validated with respect to the multicollinearity assumption. This could lead to unreliable and imprecise results and 311 this is something we avoid in this study. 312 We consciously leave the variables that did not meet the assumptions to alert the scientific community of an 313 incomplete use of the regression in bibliometrics could damage the scientific results.
314 We performed our statistical analyses using the SPSS (version 24) statistical package. We tested the statistical 315 assumptions of the multiple linear regression modeling because, to create inferences about Y from the sample data, 316 it is necessary to establish assumptions about the behavior of error ε, which defines the random behavior of Y, and 317 perform experiments according to these assumptions. 318 The assumptions of the multiple regression modeling we evaluated are as follows.
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 323 Multicollinearity assumption: In addition to linearity, another principle of regression modeling is that the 324 explanatory variables should not be correlated with each other. When two explanatory variables are strongly 325 correlated, a collinearity problem exists, and when more than two are correlated, we have a multicollinearity 326 problem. We used the following to identify if such problems were present: a matrix of correlations between 327 explanatory variables, the variance inflation factor (VIF), multicollinearity diagnoses, and the proportion of 328 variance. As per the correlation matrix, if two or more variables have a correlation coefficient greater than or equal 329 to 0.9, there is a collinearity or multicollinearity problem. If the VIF is greater than or equal to 10, there are 330 collinearity or multicollinearity problems. With the multicollinearity diagnoses, we can check the condition index, 331 which measures the association between independent variables. Its value is the square root between the largest and 332 the smallest eigenvalue. If its value is greater than or equal to 30, there are strong multicollinearity problems as 333 long as this value is attributed to the explanatory variables. The proportion of variance measures the origin of 334 multicollinearity. It represents the proportion of the variance that each eigenvalue has in each explanatory variable. 338 Failure to comply with this assumption invalidates the statistical tests performed on the regression coefficients and 339 the future values of Y. In our case, this assumption is the easiest to validate given that we have a large sample 340 (n≥30), and in practice, according to the Central Limit Theorem of large samples, we conclude that our data meet 341 the normality assumption. 342 Extreme and influential observations assumption: An observation that is distant from the rest of the data is 343 considered an outlier observation. Both extreme and influential observations affect estimations because they 344 considerably modify estimates, i.e., standard errors of high coefficients, low determination coefficients, and 345 coefficients with signs or with magnitudes that are significantly different from their true values. We evaluated this 346 using the Cook Distance criterion. An observation can be considered influential if the Cook Distance is greater 347 than or equal to 1.
. CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/558254 doi: bioRxiv preprint first posted online Feb. 22, 2019; 411 sections. One of the most important advantages of panel data with respect to other types of data is that they allow 412 us to control unobservable differences. 413 We work a random effects model, ε it is assumed to vary stochastically over i or t requiring special treatment of 414 the error variance matrix.  Table 2. Non-standardized estimates of the general model with variables GERD/GDP and A&RI without country 426 particularity. 427 428 As we can observe, the GERD/GDP and A&RI variables explain the dependent variable. Here the level of 429 significance is 0.000; therefore, the confidence level lies in the 95%-100% range. 430 431 Table 3 (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/558254 doi: bioRxiv preprint first posted online Feb. 22, 2019; 443 Table 3 shows that the adjusted R 2 value is 0.73, which means that the two predictor variables explain the scientific 444 production variable with 73% accuracy, which is quite acceptable. The predictive value of the model with its two 445 independent variables is high, as shown by the F values and statistical significance. 446 447 When the GERD/GDP and A&RI variables tend to 0 in the analyzed countries, on average, the scientific 448 production is 1628 (e + 7.395). The effects are seen when the percentage changes. For example, when the 449 GERD/GDP variable increases by 1%, the effect on scientific production will be 8.118% provided that the value 450 of the A&RI variable remains constant. In contrast, if the number of institutions increases by 1% and GERD/GDP 451 is constant, the scientific production increases to 8.564%. In general, we observe that both variables have a positive 452 influence on scientific productivity. 453 Table 4 shows the estimates of the model. This allows us to analyze the particularity of each country by applying 454 the dummy variables. 455 (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/558254 doi: bioRxiv preprint first posted online Feb. 22, 2019; 495 As can be seen in Table 6, the United States dummy variable is significant with respect to scientific production. 496 However, the dummy variable for Germany's A&RI is not significant and will therefore have the effect of the 497 reference country, i.e., Mexico. 498 499 Fig. 1 Effect on scientific production of different countries when variables GERD/GDP and A&RI tend to 0 500 Insert figure1 here. 501 502 When the economic investment and the presence of institutions tend to zero, the scientific production of Mexico 503 is 7.068%. The dummy variables of countries that are not significant and therefore demonstrate behavior equal to 504 that of Mexico include Japan, Russia, Spain, and China. The effect of these countries is the same as that of Mexico, 505 i.e., when the GERD/GDP and A&RI variables tend to zero, the scientific production of these countries is 7.068%. 506 When these variables tend to zero, the countries that surpass Mexico in percentage of scientific production are 507 Argentina with production of 14.285%, France with 21.033%, and the United Kingdom with 22.123%. The 508 maximum value comes from Canada with 25.936%. 509 As mentioned in the introduction, scientific production could be increased through collaboration. In the following 510 517 However, countries that behave like Mexico have a smaller percentage of documents in international collaboration 518 and more in institutional collaboration. These countries are Japan, the Russian Federation, Spain, and China. Note 519 that Mexico's international collaboration (37.9%) and institutional collaborations (36%) are nearly the same. 520 The only country with a lower value than the intercept is Germany, which has an effect of 4.756%, although it has 521 a high percentage of international collaboration (40%). South Korea has a 2.087% lower value than Mexico and 522 has lower international collaboration (25%) and more institutional collaboration (44%). International collaboration 523 explains the highest values in production with respect to Mexico because these countries have a greater percentage 524 of collaboration than Mexico. 525 Figure 2 shows the effect of the different countries on the scientific production relative to a 1% increase in 526 GERD/GDP. 527 528 529 Fig.2 Effect on scientific production when GERD/GDP increases by 1% in all countries 530 Insert figure2 here. 531 532 If we increase the GERD/GDP of all countries by 1%, the scientific production increases in all of them, except for 533 Argentina and France, whose scientific production diminishes and shows a negative effect. The fact is that 534 Argentina had a fluctuating and low investment in GERD/GDP throughout 1996-2008. Despite the government's 535 low investment in science, scientific production continued to increase. In 2009, there was a boom in research 536 investment, which was the maximum in the country's history to date. Thus, from that time point to the present, 537 investment has been increasing and has remained on the rise and stabilized since then. We show this fact more 538 clearly in Figure 3. The data from 2009-2012 (GERD/GDP) with 2012/2015 (Scientific production) make the 539 relationship significant in Argentina; however, as most data 1996-2008 GERD/GDP have a negative relationship 540 with respect to data from scientific production 1999-2011, the regression interprets it as negative because this is 541 the majority. If this same study would be performed 10 years later and investment in Argentina continues to rise 542 (as well as production), the sign would change and the relationship would be positive. 543 544 545 546 . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.  578 In contrast, the United States shows an effect that is 0.523 less than Mexico. Note that the United States maintains 579 the number of institutions as constant over the years; thus, the variable was excluded from regression. The countries . CC-BY 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.