Evaluating institutional open access performance: Methodology, challenges and assessment

Open Access to research outputs is becoming rapidly more important to the global research community and society. Changes are driven by funder mandates, institutional policy, grass-roots advocacy and culture change. It has been challenging to provide a robust, transparent and updateable analysis of progress towards open access that can inform these interventions, particularly at the institutional level. Here we propose a minimum reporting standard and present a large-scale analysis of open access progress across 1,207 institutions world-wide that shows substantial progress being made. The analysis detects responses that coincide with policy and funding interventions. Among the striking results are the high performance of Latin American and African universities, particularly for gold open access, whereas overall open access levels in Europe and North America are driven by repository-mediated access. We present a top-100 of global universities with the world’s leading institutions achieving around 80% open access for 2017 publications.


Introduction
Open access is a policy aspiration for research funders, organisations, and communities globally. While there is substantial disagreement on the best route to achieve open access, the idea that wider availability of research outputs should be a goal is broadly shared. Over the past decade, there has been massive increase in the volume of publications available open access. Piwowar et al. (2018) showed that the global proportion of open access articles was about 45% for those published in 2015, compared to around 5% before 1990. A more recent projection suggests that 44% of all outputs ever published will be freely accessible in 2025 (Piwowar et al., 2019).
This massive increase has been driven in large part by policy initiatives. Medical research funders such as the Wellcome Trust and Medical Research Council in the UK and the National Institutes of Health in the US led a wide range of funder policy interventions. Universities such as Harvard, Liege, Southampton and others developed local polices and infrastructures that became more widely adopted. Plan S, led by a coalition of funders 1 , announced in 2019 has as its goal the complete conversion of scholarly publishing to immediate open access. This is the most ambitious, and therefore the most controversial, policy initiative to date with questions raised about the approach (Rabesandratana,  Despite the scale and success (at least in some areas) of these interventions, there is limited comparative and quantitative research about which policy interventions have been the most successful. In part this is due to a historical lack of high-quality data on open access, the heterogeneous nature of the global scholarly publishing endeavour, and the consequent lack of any baseline against which to make comparisons.
A recent report by Larivière and Sugimoto (2018) showed a link between the monitoring of policy and its effectiveness, describing strong performance by articles supported by funders that had implemented monitoring and compliance checks for their policies. By comparison open access for works funded by Canadian funders, which did not monitor compliance, were shown to lag substantially even when disciplinary effects were taken into account.
There is also a need for critical and inclusive evaluation of open access performance that can address regional and political differences. For example, the SciELO project has successfully implemented an electronic publishing model for journals resulting in a surge of journal-mediated open access (Packer, 2009;Wang et al., 2018). Recent work by Iyandemye and Thomas (2019) showed that, for biomedical research, there was a greater level of open access for articles published from countries with a lower GDP, particularly for those in sub-Saharan Africa. This provides evidence of national or regional effects on publication cultures that lead to open access. Meanwhile, Siler et al. (2018) showed that, for the field of Global Health, lower-ranked institutions are more likely to publish in closed outlets. They suggest this is due to the cost of article processing charges showing the importance of considering institutional context when examining open access performance.

Change at the institutional level
We have argued (Montgomery et al., 2018) that the key to understanding and guiding the cultural changes that underpin a transition to openness is analysis at the level of research institutions. While funders, national governments, and research communities create the environments in which researchers operate, it is within their professional spaces that choices around communication, and their links to career progression and job security are strongest. Analysis of how external policy leads to change at the level of universities is critical. However, providing accurate and reliable data on open access at the university level is a challenge.
The most comprehensive work on open access at the university level currently available is that included in the CWTS Leiden Ranking (Robinson-Garsia et al., 2019). This utilises an internal Web of Science database and data from Unpaywall 2 to provide estimates of open access over a range of timeframes. These data have highlighted the broad effects of funder policies (notably the performance of UK universities in response to national policies) while also providing standout examples from regions that are less expected (for instance Bilkent University in Turkey).
A concern in any university evaluation is the existing disciplinary bias in large bibliographic sources used to support rankings. For example, the coverages of Web of Science and Scopus were shown to be biased toward the sciences and the English language . We, and others have shown how sources can be biased to wards disciplines and languages (Mongeon and Paul-Hus, 2016) and how evaluation frameworks based on single sources of output data can provide misleading results (Huang et al., 2020a). In a companion white paper to this article we provide more details of these issues with a sensitivity analysis of the data presented here (Huang et al., 2020b). If we are to make valid comparisons of universities across countries, regions and funders to examine the effectiveness of open access policy implementation there is a critical need for evaluation frameworks that provide fair, inclusive, and relevant measurement of open access performance.

Challenges in evaluating institutions
Building a robust open access evaluation framework at the institutional level comes with a number of challenges. Alongside coverage of data sources are issues of scope (which institutions, what set of objects), metrics (numbers or proportions) and data completeness. Our pragmatic assessment is that any evaluation framework should be tied to explicit policy goals and be shaped to deliver that. Following from our work on open knowledge institutions (Montgomery et al., 2018) our goals in conducting an evaluation exercise and developing the framework are as follows: 1. Maximising the amount of research content that is accessible to the widest range of users, in the first instance focusing on existing formal research content for which metadata quality is sufficiently high to enable analysis 2. Developing an evaluation framework that drives an elevation of open access and open science issues to a strategic issue for all research-intensive universities 3. Developing a framework that is sensitive to and can support universities taking a diversity of approaches and routes towards delivering on those goals In terms of a pragmatic approach to delivering on these we therefore intend to: 1. Focus on research-intensive institutions, using existing rankings as a sample set 2. Seek to maximise the set of objects which we can collect and track while connecting them to institutions (i.e., favour recall over precision) 3. Focus on proportions of open access as a performance indicator rather than absolute numbers 4. Publicly report on the details of performance for high performing institutions (and provide strategic data on request to others) 5. Report on the diversity of paths being taken to deliver overall access by a diverse group of universities 6. Develop methodology that is capable of identifying which policy interventions have made a difference to outcome measures and any 'signature' of those effects 2 Results

A reproducible workflow to evaluate institutional open access performance
We developed a reproducible workflow capable of quantifying a wide range of open access characteristics at the institutional level. The overall workflow is summarised diagrammatically in Figure  1.  As we have noted previously (Huang et al., 2020a) there is a sensitivity associated to the choices in bibliographic data sources when they are used to create a ranking. For this analysis we therefore chose to combine all three datasets (i.e., Microsoft Academic, Web of Science and Scopus). In the companion white paper (Huang et al., 2020b) we provide a comprehensive sensitivity analysis on the use of these different datasets, the use of different versions of Unpaywall, and the relations between confidence levels and sample size.
Briefly, it is our view that to provide a robust assessment of open access performance the following criteria must be met: 1. The set of outputs included in each category (here institutions) and a traceable description of how they were collected must be transparently described.

Top 100 global universities in terms of total open access, gold open access and green open access
In Figure  . This is, to our knowledge, the first set of university rankings that provides a confidence interval on the quantitative variable being ranked and compensates for the multiple comparisons effect. Across this top 100 the statistical difference between universities at the 95% confidence shows that a simple numerical ranking cannot be justified. The high performance of a number of Latin American and African universities, together with a number of Indonesian universities, particularly with respect to gold open access, is striking. For Latin America this is sensitive to our use of Microsoft Academic as a data source (Huang et al., 2020b) showing the importance of an inclusive approach. The outcomes for Indonesian universities are also consistent with the latest report on country-level analysis (Van Noorden, 2019). These suggest that the narrative of Europe and the USA driving a publishing-dominated approach to open access misses a substantial part of the full global picture.

The global picture and its evolution
The To examine the global picture for the 1,207 universities in our dataset and to interrogate different paths to open access we plot the overall level of repository mediated ("green") and publisher mediated ("gold") open access for each university over time coloured by region as previously. Figure  3 presents the results for 2017 (with changes over time shown in the animated version, see also Supplementary Figure 9 for each year).

Overall universities in Oceania (Australia and New Zealand) and North America (Canada and the US) lag behind comparators in Europe (on repository-mediated open access) and Latin America
(on open access publishing). Asian universities are highly diverse. As seen in Figure 2 there are some high performers in the top 100s, particularly for open access publishing, but many also lag behind. Africa is also highly diverse but with a skew towards high performance, with an emphasis on open access publishing. This may reflect our sampling which is skewed towards institutions with the largest (formally recorded) publishing volumes, many of which receive significant portions of their funding from international donors with strong open access requirements. Latin American institutions show high levels of open access publishing throughout the period illustrated. This is due to substantial infrastructure investments in systems like SciELO starting in the 1990s.

Different institutional paths towards open access
In Figures 3 and 4 we see evidence of different paths towards open access, depending on the context and resources. The idea of mapping these paths is shown explicitly for a subset of universities in Figure 5. This shows the paths taken by two sets of UK universities and a selection of Latin

Implications for evaluating open access and limitations
Previous work has been mostly limited to one off evaluations and provided a limited basis for longitudinal analysis. Our analysis process includes automated approaches for collecting the outputs related to specific universities, and the analysis of those outputs. Currently the addition of new universities, and the updating of large data sources is partly manual but we also expect to automate this in the near future. Along with the nature of this article this will provide an updatable report and longitudinal dataset that can provide a consistent and growing evidence source for open access policy and implementation analysis.
While it is clear (Huang et al., 2020b) that our analysis has limitations in its capacity to provide comparable estimates of open access status across all universities, our approach does provide a reproducible and transparent view of overall global performance. There are challenges to be addressed with respect to small universities and research organisations and we have taken a necessarily subjective view of which institutions to include (see Supplementary Methodology). Our approach systematically leaves out most universities with very small number of outputs (i.e., less than 100 outputs), and universities with very extreme open access proportions as these are the universities for which we have less statistical confidence in the results. This is also in-line with our intended focus on research-intensive universities. These small institutions are of significant interest but will require a different analysis approach. We include the full set of institutions in our data set in Supplementary Figures 3 to 8.
We have used multiple sources of bibliographic information with the goal of gaining a more inclusive view of research outputs. Despite this there are still limitations in the coverage of these data sources, and a likely bias towards STEM disciplines. In addition the focus of Unpaywall on analysis of outputs with Crossref DOIs means that we are missing outputs for disciplines (humanities) and output types (books) where the use of DOIs is lower. In addition, due to the nature of this work and to limitations on the use of Web of Science and Scopus APIs, we have collected data from these two sources over a period of time. Although we expect such changes to be small those effects are not clearly represented in our data. For other data sources we are able to precisely define the data dump used for our analysis, supporting reproducibility as well as modified analyses.

Requirements and approaches for improving open access evaluation
There have been many differing assessments of open access performance over the past 10-15 years. Many of the differences between these have been driven by details in the approach. This combined with limited attention to reproducibility has led to confusion and a lack of clarity on the rate and degree of progress to open access (Green, 2019). As noted above we believe that a minimum standard should be set in providing assessments of open access to support evidence based policy making and implementation (see Section 2.1).
With such a minimum standard in hand we can clearly identify areas to improve open access performance assessments. There is significant opportunity for improving the data sources on sets of outputs and how they can be grouped (e.g. by people, discipline, organisation, country, etc). Improvements to institutional identifier systems such as the Research Organisation Registry, increased completeness of metadata records, particularly that provided by publishers via Crossref on affiliation, ORCIDs and funders, and enhancing the coverage of open access status data (for instance by incorporating data from CORE and BASE), will all enhance coverage. There are also opportunities to expand the coverage by incorporating a wider range of bibliographic data sources.

Conclusion
The evidence-base for policy development and implementation for open access has been hampered by a lack of consistency in analysis results and clarity on how those results were obtained. In particular it has been challenging to provide longitudinal and transparent results to monitor the effects of policy and support interventions. While not all readers will agree with the choices we have made in implementing an analysis process we aimed to provide sufficient transparency and reproducibility to allow for both replication, critique and alternative approaches to this analysis. This can underpin a higher quality of debate and policy development globally, and aid in learning from successes in other regions. The value of analysis at the level of universities is that we gain a picture of open access performance across a diverse research ecosystem. We see differences across countries and regions, and differences between universities within countries. Overall we see that there are multiple different paths towards improving access, and that different paths may be more or less appropriate in different contexts. Most importantly, while further research is needed to unpick the details of the differences in open access provision, we hope this work provides a framework for enabling that longitudinal analysis to be taken forward and used wherever it is needed.

Technical infrastructure, reproducibility and provenance
Our technical infrastructure is constructed based on the aim to make openly available both the data and analysis code as much as possible. The data infrastructure is currently based in Google Cloud Platform, mainly utilising services in Google Storage, Google Functions, and Google Bigquery. Google Functions are used to extract data from the data sources' APIs. The raw data is then stored on Google Storage for further processing. Google Bigquery is then used to merge and manipulate the raw data to process derived data in formats we require for further analysis.
Derived data used for analysis can be found at Zenodo (Huang et al., 2020b). Updated datasets will also be provided and will be found at the same location. Raw data is not provided to preserve the anonymity of institutions and respect the terms of service of data providers. The SQL queries and code used to generate the derived datasets are described below and available via Zenodo.

The main article, Supplementary Figures and this Supplementary Methodology were prepared as
Jupyter notebooks to provide all analysis and visualisation code and maximise reproducibility. These notebooks are available at Github and Zenodo (Huang et al., 2020c). All manipulation of derived data after import is explicitly conducted in the notebook. The notebook utilises a library for generating visualisations, which is provided with the notebooks. The only data manipulations performed by the visualisation library are to filter and re-shape the data for graphing.
Where possible we use a publicly available or defined data dump for our sources. In this article data from Crossref, Unpaywall, GRID and Microsoft Academic were available as data dumps.

Data sources
We integrate a variety of data sources in our data workflow to generate open access scores for a large set of universities. These sources include the following:

Web of Science
Web of Science is a large pay-walled online scientific citation indexing service, maintained by Clarivate Analytics, with the coverage of more than 90 million records and 1.4 billion cited references. It is often harvested by universities to build-up their internal research information database. And, with much criticism, it is also used by various university rankings to evaluate performance (e.g., Academic Ranking of World Universities, CWTS Leiden Ranking, U-Multirank, etc). It stands as an important tool for various stakeholders of the academia. Web of Science includes a number of databases with varying levels of accessibility and information. For this study, we utilise the "organization-enhanced" search functionality to extract the list of publication metadata (hence, the corresponding DOIs) for each institution of interest (via our local access) from the Web of Science Core databases. Our access to Web of Science Core is restricted by our institutional subscription contract, which provides access to the following:

Scopus
Scopus is an abstract and citation database launched by Elsevier in 2004. It provides subscription access and also produces a range of quality measures such as h-index, CiteScore and SCImago Journal Rank. For the purpose of our current work, we match each institution to its Scopus Affiliation ID and, subsequently, access the metadata of all publications related to each institution (again, via local access). The DOI, if existent, is extracted from each publication's metadata.

Microsoft Academic
Microsoft Academic, re-launched in 2016, is a replacement of the phased-out Microsoft Academic Search. It is a free public search engine for the academic literature and uses the semantic search technology developed by Microsoft Research. The database provides Affiliation Entity IDs for institutions. We utilise a snapshot of Microsoft Academic database to extract publication metadata related to each institution.

Times Higher Education World University Rankings
This is an annual ranking produced by the Times Higher Education (THE) magazine. It is one of the most followed university rankings, together with the Academic Ranking of World Universities and Quacquarelli Symonds (QS) World University Rankings. The major components of the THE Ranking include its reputation survey and the citations data from Web of Science. As a mean for comparisons, we selected the Top 1000 universities in the 2019 THE Ranking as our primary sample for calculating OA scores. This is supplemented with additional universities for countries with limited coverage in the primary sample. Subsets of universities are selected for longitudinal studies.

Unpaywall
Unpaywall is a browser extension for finding free legal versions of paywalled research publications. It currently covers more than twenty-two million free scholarly articles and provides a large number of metadata related to OA, such as journal OA status (via DOAJ) and open license information. It has recently been integrated into the Web of Science and Scopus databases. For this study, each DOI of interest is matched with its metadata in Unpaywall for calculating the various OA status. Snapshots of the Unpaywall database are collected as part of data processing of this project.

Crossref
Crossref is a not-for-profit official DOI registration agency of the International DOI Foundation. It is the largest (in terms of number of DOIs assigned) DOI registration agency in the world. It also provide JSON structured information surrounding each of its DOI, such as various related issue dates, and links between distributed content hosted at other sites. Our data collection process have resulted in several sanpshots of the Crossref database. We primarily view Crossref DOIs as the basis for coverage of all global outputs in our study. We use the issued.date element in Crossref as the standardised indicator for publication year for each output in our data.

Global Research Identifier Database (GRID)
This is an open access database of educational and research institutions worldwide. It assigns a unique GRID ID to each institution and, where applicable, to each level of the institutional hierarchy. The metadata includes information such as geo-coordinates, websites and name variants. These identifiers are adapted in our study to link the various bibliographic data sources and to unify the identification system.

Description of data workflow and selection criteria
As discussed in the main article, our pragmatic approach is to include the widest coverage of outputs for each of the universities under consideration. This implies defining a target population for all potential research outputs, which is no trivial task. For this study, we choose to consider the set of all research outputs with Crossref DOIs as this target population. This is identified as the most practical approach that allows tracking and disambiguation of research objects using persistent identifiers. At the same time, it provides processes for both the standardisation of publication dates and the use of Unpaywall's OA information.
We use universities listed in the top 1000 of the Times Higher Education World University Rankings as an initial sample for which to collect data. We then supplemented this with additional institutions, focussing on the United Kingdom and the United States. Finally we added additional universities in countries where our original sample had one or two universities. For each of these countries we added a small number of additional universities, prioritising those with the largest number of research outputs recorded in Microsoft Academic.
Given that Microsoft Academic, Web of Science and Scopus have different internal institutional identifier systems, the next step is to map these identifiers. We first map each university to its unique ID in the Global Research Identifier Database (GRID). Subsequently, these universities' internal identifiers for Microsoft Academic, Web of Science and Scopus are matched against the corresponding GRID IDs. This is trivial in the case of a Microsoft Academic database snapshot as each institution in its database is already matched against the corresponding GRID ID. For Web of Science and Scopus, manual website searches are required to retrieve Web of Science Organisation-Enhanced names and Scopus Affiliation IDs, respectively. Universities not identifiable in at least one of the three bibliographic data sources are not processed further.
Queries were run via the respective APIs against Web of Science (via Organisation-Enhance name search) and Scopus (via Affiliation ID search) to extract metadata of all outputs affiliated to each university for the time frames 2000 to 2018. These are matched against outputs from a Microsoft Academic snapshot to result in a comprehensive set of outputs for each university. Subsequently, these are filtered down to include only objects with Crossref DOIs. This current set of universities is then further expanded to include additional universities from countries that had low representations in the initial sample, and goes through the same data collection process.
All collected Crossref DOIs are matched against an Unpaywall database snapshot for their open access information. This allows us to calculate total numbers for various modes of open access (e.g., number of Gold OA publications) for each university across different timeframes (using the "year" component of the Crossref "issued date" field). The Unpaywall information used to determine various open access modes is as displayed in Figure 1 of the main text. Crossref DOIs not found in Unpaywall are defaulted to be not open access and Crossref DOIs that do not have an "issued date" are removed from the process.
A comprehensive sensitivity analysis on the use of different sources for gathering research outputs, use of different Unpaywall versions, and the relations between confidence levels and sample size is provided in the companion white paper Huang et al. (2020b). There are changes based on which specific Unpaywall snapshot is used. This is partly due to real changes (e.g. release of works from repositories after embargo) and due to changes within the Unpaywall data system (examples include changes in upstream data sources such as journal inclusion or exclusion in the Directory of Open Access Journals (DOAJ), and internal changes such as improved repository calling or wider journal coverage). As this is a product of gradually improving systems underpinning Unpaywall we use the most recent available snapshot to provide the most up to date data in a reproducible and identifiable form in the main article.
To make comparable and fair analysis across universities, we have taken a necessarily subjective view on which universities to include in

Overview of the data workflow and main tables
Within our Google Cloud Platform environment we maintain two sets of tables. The latest snapshot of each part of the dataflow is distinguished as _latest and specific snapshots are named for the date of production. We maintain a set of views within the Google Cloud BigQuery platform that are setup to query the latest available data. For this article we have created a specific snapshot, and that data is shared. Future versions of the shared dataset will provide updated snapshots.
The overall flow of data and the main tables and queries are described in Figures SM1 and SM2. The key table in our workflow is the institutions table. Figure SM1 details the main elements of the workflow that generates the institutions table. Figure SM2 details the downstream processing used to generated the derived datasets used in our analyses.

Modes of open access
As summarised in the main article, we query each of three bibliographic data sources (Web of Science, Scopus and Microsoft Academic) for its list of research output affiliated to a given university, from years 2000 to 2018. Subsequently, this is filtered down to all objects with Crossref DOIs (by mapping against a Crossref data snapshot) and matched against Unpaywall metadata. We use the aggregated sets of DOIs for each year of publication (as per "issued date" in Crossref) to compute the counts for various OA modes using data from Unpaywall. The details of how different OA characteristics are calculated is shown in Figure 1 in the Results section of the main article. The details of the SQL query used to categorise OA status can be found below.
While there is a large body of literature on OA, the definitions of OA are quite diverse in detail. Policy makers and researchers may choose to use the OA terminology in different ways. Popular discrepancies include the coverage of journals without formal license of reuse and articles only accessible via academic social medias or illegal pirate sites. We use the following definitions for the modes of OA determined as part of our data workflow: • Total OA: A research output that is free to read online, either via the publisher website or in an OA repository. • Gold: A research output that is either published in a journal listed by the Directory of Open Access Journals (DOAJ), or (if journal not in DOAJ) is free to read via publisher with any license. • Gold DOAJ: A research output that is published in a journal listed by DOAJ.
• Hybrid: A research output that is published in a journal not listed by DOAJ, but is free to read from publisher with any license.

.3 Identification of grids in scope
Our full dataset includes all those institutions for which a GRID is recorded in Microsoft Academic Graph. In this article we have focussed on a set of institutions seeded from the top 1000 institutions in the THE World University Ranking supplemented for greater geographical coverage and deeper coverage of specific countries. We identify those GRIDs in scope for this article by identifying GRIDs for which we have additionally collected data from Scopus and Web of Science. As there are a small number of non-university research institutions in this set we explicitly exclude them.
The query that generates the grids_in_scope

Identification of named universities in the top 100s
The dataset containing named universities are all of those which fall, for any year from 2013-2018 inclusive, into the top 100 of: 1. Overall percentage of OA (i.e., Total OA) 2. Percentage of Green OA 3. Percentage of Gold OA This group is identified using the following query and code, which is used to generate the table 'named_institutions_in_scope' which is also provided in the shared dataset. We select 110 from each category due to the filtering downstream of smaller institutions. This table is generated by a small python script as follows: import pandas as pd import pandas_gbq template = """ SELECT id, name FROM`academic-observatory.institution.

Generation of the derived datasets
The four main datasets that are publicly shared are generated directly from the institutions table using either the grids_in_scope or named_grids_in_scope tables to provide a filter for the set of institutions. The queries have some minor differences to provide the data of interest in each case. All four queries are provided in the Data and Queries package available at Zenodo. Here we show the query for the generation of the full_paper_dataset. We use a salt to generate the anonymised IDs for each university.

March 19, 2020
This section provides additional supplementary figures that are largely in parallel with those presented in the main article, apart for different publication years. Supplementary Figures 1