Abstract
Here, I present the fCite web service (fcite.org) a tool for the in-depth analysis of an individual’s scientific research output. While multiple existing tools (e.g., Google Scholar, iCite, Microsoft Academic) focus on the total number of citations and the H-index, I propose the analysis of the research output by considering multiple metrics to provide greater insight into a scientist’s multifaceted profile. The most distinguishing feature of fCite is its ability to calculate fractional scores for most of the metrics currently in use. Thanks to the division of citations (and RCR scores) by the number of authors, the tool provides a more detailed analysis of a scholar’s portfolio. fCite is based on PUBMED data (∼18 million publications), and the statistics are calculated with respect to ORCID data (∼600,000 user profiles).
Introduction
At present, the impact of the work of a scientist can be estimated by a number of bibliometric metrics, but there is a strong bias towards the number of articles written by an author, the total number of citations of those articles, the impact factors of the journals in which they appeared 1 and finally the H-index 2. In contrast to this approach, increasing number of people are opposed to a bibliometric, mechanical modus operandi and in favour of expert assessment (e.g., see the DORA declaration). Although expert approach is a compelling idea, in real life a fair assessment of a scientific portfolio comprising multiple publications (for instance containing over 100 items) spread across multiple journals (e.g., in 2019 PUBMED alone indexed 48,601 journals) may not be possible in a reasonable amount of time. Even if the expert is familiar with the quality of the journal in which the publications appeared and, even if he/she has read the most important publications from the portfolio, he/she still needs to understand and judge the author’s contribution to given work(s). With multiple-author papers this can be very difficult to achieve. Acknowledgment statements (if any) are usually very frugal, and it is often impossible to say which part of the work was done by which author. Moreover, if one also recognizes that publications are frequently interdisciplinary, the proper assessment of the influence of the average-sized portfolio is beyond the scope of a single person, and ultimately, it is very subjective. On the other hand, in many fields the order in the author list can be considered a rough approximation of the contribution, where the first author is the scientist who performed the most of the experiments (e.g., a PhD student), the middle authors are those who helped with multiple specialized parts of the work or/and the analysis of the data and the last/corresponding author is a principal investigator who conceived the project, obtained the funding and supervised all steps (frequently this does not exclude involvement in the experiments or the analysis). Such a model is termed the first-last-author-emphasis (FLAE) model 3, and depending on how much emphasis is placed on a particular author, the FLAE model can have multiple flavours (here we use three models named FLAE, FLEA2, and FLAE3; for details see the supplementary methods). However, given a sufficient number of items in the portfolio, they yield very similar results 4. In contrast, if the order of authors is random or alphabetical, we can always use the equal contribution (EC) model in which each author has the same weight (Figure 1, Figures S1-S4, Tables S1-S5). At this point, an open question is how to assess the influence of the publication. The most accessible (and frequently used) metric is the number of citations it has received over time. Usually, metrics such as total citation counts or the H-index are calculated using global scores (regardless of the number of authors, each author obtains all of the citations of the publication), but applying FLAE or EC models provides a straightforward way to quantify the author’s contribution in a more precise manner. The division of the contribution is a highly demanding (and overlooked 5) feature because the number of authors has increased steadily over time to exceed an average of six authors per research publication in 2015 (Figure 2, Table S6). The trend of having increasing numbers of authors is also clear when we analyse the mode of the number of authors in publications over the last 25 years (Figure 3, Tables S7-S31). Currently, publications with hundreds of authors are not rare, and some items can have more than several thousand authors. Concomitantly, shared first or last authorship has become a common practice, and it is not difficult to find publications with three or more shared first authors and few corresponding/last authors.
Rewarding all authors, regardless of their number, is an obvious shortcoming of current bibliometric tools and contradicts common sense. Consider a hypothetical situation in which you are a member of a grant or fellowship committee and have two applicants. Both of them published single publications in the same journal (for simplicity of the example), but the first publication has two authors (the applicant and his/her supervisor), while in the second case the applicant is the middle author of a consortium paper (such papers usually have few hundred authors). After a few years since the publication date, you see that the first publication has received a few dozen citations, while the second has a few hundred citations. Which candidate would you prefer? In the presented example, most experienced assessors would prefer the first candidate. If you were to do some back-of-the-envelope calculations, you would conclude that the first item has roughly an order of magnitude more citations per author than the second, and it is almost trivial to assess the contribution to the first publication; however, when there are a few hundred authors, it is literally impossible to say who did what, and most likely those hundreds of citations are self-citations or/and courtesy citations (for instance, the publication describes an important resource used by the whole field). The presented example is highly simplified, and usually there are more items in portfolios, which significantly complicates the analysis. Typically, portfolios will be more diverse in the number of items and journals in which they were published. The present situation is so abysmal that current state of the art is to ask the applicant to select the 5-10 most important publications and offer his/her opinion why the works are examples of paramount science (e.g., European Research Council grants). The fCite web service presented here should fill the gap between the (overly) simple bibliometric and expert-based approaches, facilitating a fairer assessment of scientific output.
Results
The primary result of the work presented here is, a web service (fcite.org), where an author’s contribution can be calculated using fractional models in addition to the plethora of statistics related to a given portfolio (Figure 4). The fCite service operates using a list of PMID (PubMed IDentifier) and/or ORCID (Open Researcher & Contributor ID) ids accompanied by all combinations of the names and surnames of a given author. The analysis can be performed for all items (research articles and non-research items such as editorials, reviews, and others) or separately only for the research items. The fCite service relies on citations and so-called RCR (relative citation ratio) scores that come from the iCite web service and are calculated based on the PUBMED database (6). As of June 2019, fCite comprises over 17 million publications and counting (the fCite database grows by ∼100,000 items each month). The reference for the scores comes from the analysis of ORCID data (572,910 profiles with 7,008,012 unique publications in total). The ORCID profiles provide the names of the authors, together with the lists of co-authored publications. Thus, by using string metrics such as Levenshtein and Jaro–Winkler distances (7), it is possible to identify an author’s position in the list of the authors for each publication. This provides the unique opportunity to study authorship patterns depending on whether the person is the first, middle, last or single author. These data provide a solid foundation for the assessment of the author’s position importance with respect to portfolio size and time. As shown in Figure 5 (and Tables S32-S33), at the beginning of a scientist’s career (with small portfolios with fewer than 10 items), a substantial number of first author papers are expected. As the researcher progresses (and the number of publications increases), the last author publications begin to take the place of the first author publications. Surprisingly, the middle and single authored fractions are roughly stable regardless of portfolio size (where single author papers are extremely rare, and middle author papers constitute over half of the items). Moreover, calculating the main scores provided by fCite (e.g., FLAERCR, FLAE2RCR, FLAECit, FLAE2Cit, ECRCR, ECCit,. For detailed explanations and details, see the “Materials and Methods” section) for the ORCID users provides solid ground for the assessment of the importance of the obtained numbers (so-called percentiles) (Table 1).
Discussion
Over the past fifty years, bibliometrics have become an inherent part of the assessment of scientific progress. Such measures, with all of their pros and cons, will be used regardless of whether we support this approach. This development began with the impact factors defined by Eugene Garfield in the 1950s and peaked with the creation of the H-index in 2005. Despite their multiple shortcomings, bibliometrics can offer an instantaneous and relatively fair assessment of science impact. As many scholars have noted previously, one cannot focus on a single number because it cannot embrace the complexity of a researcher’s work (see also Goodhart’s adage). Therefore, it is not proper to focus, for instance, only on the total number of citations or the H-index (which is itself highly correlated with the total number of citations 8). As those two metrics have been frequently used by funding bodies (consciously/openly or not), researchers have optimized their behaviour, which has led to citation cartels 9, salami slicing 10, and continual increases in the number of authors per publication and the self-citation rate 11. A so-called “publish-or-perish” culture has emerged. The purported quality of a work is inherited directly from impact factors of the journal immediately after publication. The number of authors of papers is irrelevant because all of them receive full credit. Even if someone were to state that the first or the last/corresponding authors are more important, the community has already found easy “fix” by adding multiple first authors and last authors. This may sound pessimistic, but some fields have already adapted to this new reality very well. Therefore, no one is surprised when a high-energy physics paper has a few thousand authors. Similar approaches are emerging in other fields. In medicine, which already has one of the highest number of authors per publication, many believe that the data provider should be listed as an author of all subsequent publications even without making any other contribution 12. Recently, there is a growing trend towards establishing consortia or groups containing multiple labs/consortia. While this has the advantage of making collaborations that can have synergistic effects, it also has disadvantages. Usually, such initiatives are based on multi-million-dollar grants, and as the results appear, the whole group (usually a few hundred authors) is assigned as the authors of almost every paper produced by the consortium. The fCite proposed here stymies such malicious behaviour by simply dividing the citations (or RCRs) by the number of authors while taking into account an author’s number and position. Obviously, the FLAE models used here are far from perfect, and they cannot replace an experienced assessor who reads publications and is familiar with all the insights of his/her own field, but they are a good starting point and certainly a better solution than using the total number of citations or H-index.
All of this being said, one should be aware of the multiple limitations of fCite. First, as fCite is based on PUBMED, it is not appropriate for many fields that are not well represented in that database (for instance, computer science or social sciences). Second, fCite currently does not filter out self-citations (under development). Moreover, the developers have this far been unable to tag and fix all consortium/group papers. Frequently, such items’ authorship appears as “John Smith, Jan Kowalski, BRAINS Group”, where the first two are leaders, and the group consists of a hundred or more people whose names appear in the supplementary material. FLAE models will count such cases as papers with three authors (likely to have hundreds of citations designating only those three authors and elevating the scores for the first two). Finally, the last shortcoming of fCite is that it accords all citations the same weight, which is a massive simplification. This aspect of bibliometrics is well studied and can be considered from many angles. First, not all citations are equal, as a citation can be positive or negative (where the subsequent authors disagree with the original hypothesis). Then, even if the citation is positive, it may have different meanings depending on the section of the article where it is made (the introduction, the methods or the discussion). Moreover, some citations are more important because they are cornerstones for subsequent research, while others are simply review mentions used briefly in the introduction. The other aspect related to citations is that frequently the most important citations are lacking due to journal restrictions (for instance, the entire method section, pivotal for any research, is placed in the supplementary material, which has its own reference list and is not listed in most databases). Many such cases can be handled by semantic methods 13, but this approach remains in its infancy. Other characteristics that are frequently used (but not implemented in fCite) are the importance of the journal from which the citation comes (e.g., SJR indicators developed by SCImago 14). Nevertheless, fCite is the first, large-scale method that takes into account the number of authors and their positions (only one-time analyses of specific journals, fields, or nations have been done in the past 15, 16). Additionally, fCite uses RCR scores that are taken from iCite (based on PUBMED). Note that these scores differ from citations in many aspects. First, RCR is intended to capture field relevance (it is normalized with respect to the field’s citation levels). Next, in contrast to citations, which are only additive metrics, RCRs can decrease over time. This aspect, while it has been criticized by some 17, is a very useful and demanded feature of bibliometric metrics (that is also lacking in the H-index), as RCR can decline when the work begins to be outdated. The other feature of RCR that should not be overlooked is that this metric gives more weight to newer articles (for instance, ten citations for ten-year-old and two-year-old articles will result in dramatically different RCR scores).
A highly illustrative example is an analysis of top researchers in comparison to the scores provided by Google Scholar and other sources. Table S34 reports a selection of statistics for some successful scientists (many of whom are listed in the Highly Cited Researchers (HCR) created by Clarivate Analytics). It is clear that the H-index or total citation counts can often be misleading, and more comprehensive analysis using multiple bibliometric metrics can help. For instance, while HCR is most likely filtered against easy-to-spot cases such as Scientists B and D, it still frequently includes cases such as Scientists A and F. On the other hand, due its limitations (having >15 so-called “Highly Cited Papers”), some outstanding scientists are overlooked (e.g., Scientist C). Therefore, fCite can be a very useful tool for deep profiling of even very similar portfolios (with respect to the H-index or total citation count) with surprising discriminatory power (e.g., compare Scientists M and N).
In summary, fCite (available free of charge at fcite.org) is a bibliometric tool that provides versatile metrics that can take into account the number of authors and their position on the authorship list. Hopefully, it will facilitate unbiased comparisons of researchers’ importance when they are competing for limited funding and, consequently, enhance scientific development.
Funding
L.P.K. (Solid Scientometrics).
Author contributions
L.P.K. conceived the project, acquired and analysed the data, developed the web service, and prepared the manuscript.
Competing interests
The author declares no competing interests. The Solid Scientometrics company owned by L.P.K. had been established to separate and protect the intellectual property of this project. Since November 2018 L.P.K. is employed in the Institute of Informatics, University of Warsaw, where his research is focused on the computational biology (the proteomics). The project and the Solid Scientometrics company had been conceived before had been employed in the University of Warsaw.
Data and materials availability
The fCite web service (fcite.org) is available free of charge. All the data needed to evaluate the conclusions in the paper are presented in the Supplementary Materials and at the fCite web site. The raw data come from NIH (iCite) and ORCID databases and are available as stated in the Supplementary Materials.
Supplementary Materials
Materials and Methods
Data sets
PUBMED data set – contains PUBMED publications (17,787,016 publications; 14,444,982 research and 3,342,034 non-research items) obtained via icite.od.nih.gov portal (Data S1 and S2)
ORCID data set – contains 5,380,983 user profiles (Data S3)
Fractional contribution models of the authorship
Four, different models had been used to assess the author contribution:
FLAE (first-last-author-emphasis) model is based on Tscharntke et al. 2007 definition with slight modifications[1](3). The contribution of individual authors can be described briefly as “the first author gets 100, the last 50, and all others 100/number of authors and then scores are normalized to 1”. This type of the model gives the strongest weights to the first and the last author penalizing middle authorship (Table S1).
FLAE2 model is based on Corrêa Jr. et al. 2017[2](4). This is empirical model based on the authorship contribution for the mega-journal PLoS ONE (∼65,000 publications). On average this model is more benign for middle authors. As the data presented by Corrêa Jr. and co-authors are limited up to ten authors, for the longer author lists the contribution has been modeled by curve fitting with some noise using the initial matrix (with up to ten authors), and thresholds 0.06 for 30 authors and 0.07 for 100 authors (for more details see scipy.optimize.curve_fit documentation and http://www.fcite.org/FLAE2.txt) (Table S2). Additionally, this model is asymmetric, i.e. middle author weights depends on the position, the closer to the first author, the better are weights for middle author (up to 10th author).
FLEA3 model is a simple variation of FLAE model, but the contribution of individual authors is more equal. It can be shortly described as “the first author always get at least three times more than co-authors, and the last author at least two times more than other co-authors” (Table S3).
EC (equal contribution) model assumes that each author contributed equally to given work (Table S4).
First three models assume that for a given field the order of the authors is not random and the first author was the one who contributed the most while the last is a senior author who conceived the project (frequently the corresponding author). Such assumption is true for many sub-fields of the biomedical sciences. Alternatively, in many other sub-fields the order of the authors can be alphabetical, ordered from the most significant to the least significant author or completely random or irregular. In such cases EC model should be used. For simplicity, all models assume that there is only one first and one last author which is not necessarily true as with strong pressure for the publishing in the top journals and having more and more authors per the work, nowadays many papers have multiple first and senior authors. All four models are metrics agnostics thus they can be used for the citations, RCR scores, or/and H-index.
Additional metrics used in fCite
In order to analyze the portfolio, the user is asked to provide all combination of author names and surnames, and the list of PMID identifiers. As a result he/she obtains:
a) the size of the portfolio with the time span of the publishing period,
b) the number (and the percentage) of the single, the first, the last and the middle author papers,
c) H-index[3]
d) M-index (H-index divided by the number of years from the first publication),
e) fH-index and fM-index (the citations are divided according the author contribution to each paper using FLEA model),
f) the average number of the papers per year,
g) the total and fractional citation and RCR scores based on FLAE, FLAE2, FLAE3, EC models (Citations, total RCR, FLAERCR, FLAE2RCR, FLAE3RCR, ECRCR, FLAEcit, FLAE2cit, FLAE3cit, ECcit, respectively),
h) the average number of the authors,
i) FLEA per year (RCR),
j) the average FLAE article score (RCR),
k) the average article impact per year (RCR),
l) the ratio between FLAERCR and total RCR and the expected value,
m) sortable table for individual publications with PMID, year, title, authors, article-type, journal, and FLAERCR, FLAE2RCR, FLAE3RCR, ECRCR, FLAEcit, FLAE2cit, FLAE3cit, Eccit, Citations, total RCR scores.
Initial data cleaning
In order to analyze the authorship patterns, the PUBMED data set (over 17 million of publications) had been mapped into ORCID portfolios (over 5 million of users). The ORCID data set provided author name and surname with the list of publications. First, an empty records (the profiles without public data) had been discarded (4,217,452 out of 5,380,983 records). Next, the portfolios with at least one publication with the DOI, PMID or PMC identifiers had been filtered. This gave 1,154,443 portfolios (with 19,516,285 non-unique articles in total). As 19,097,891 (97,85%) of items had only DOI identifier, additional step was required (namely mapping DOI to PMID identifiers). The whole PUBMED records (27,414,004 publications) had been search for DOI using summary XML files and eutils tool provided by the National Institutes of Health (NIH). As a result 599,468 (7,813,971 articles) of non-empty portfolios with at least one PMID had been obtained.
Example record from ORCID (csv format)
ORCID,surname,name,list_of_PMIDS
0000-0002-2518-5940,Liebovitz,David,23550982||23646091||19468082||22034582||19267397|| 17219478||19647184||28527507||17219519
Example record from PUBMED (json format)
{ “pmid”: 23456789, “doi”: “10.1002/cncr.27976”, “authors”: “Arun Sharma, Stephen M Schwartz, Eduardo M\u00e9ndez”, “citation_count”: 26, “citations_per_year”: 4.333333, “expected_citations_per_year”: 2.538138, “field_citation_rate”: 4.872565, “is_research_article”: true, “journal”: “Cancer”, “nih_percentile”: 69.700000, “relative_citation_ratio”: 1.707288, “title”: “Hospital volume is associated with survival but not multimodality therapy in Medicare patients with advanced head and neck cancer.”, “year”: 2013 }fCite uses following fields: authors, citation_count, relative_citation_ratio, is_research_article, year and pmid.
Data analysis
One of the first steps of the analysis was to clean the name and surname provided by the ORCID database. The data in ORCID are in the UNICODE (UTF-8) format which means that they can contain any Non-English letters. Thus, at this step all surnames and names had been translated to equivalents of English letters (e.g., Kozłowski Łukasz to Kozlowski Lukasz, to Wu Feng). Then, given the list of PMIDs in the portfolio, all publication records from PUBMED had been retrieved. In order to identify the author position on the authorship list, Levenshtein and Jaro–Winkler distances had been applied in the following way. First, a set of possible surname and name combinations had been prepared (name surname, surname name, n surname, name initial surname, etc. for instance given John Smith, the set contained john smith, smith john, j smith, john x smith, etc.). This step was required as the order of the name, surname, initials and the letter size are frequently different in the databases or/and particular publication records. Next, for each author in the individual authorship list the Jaro–Winkler distance is calculated. The author which has the highest Jaro-Winkler distance is used (the similarity threshold of 0.7 is used to filter out non-important hits). From now on, for each publication in the portfolio, the position of the author is available and can be used to divide the publication into the sole, first, last and middle author ones. Having positions of the authors allow to use fractional models (FLAE, FLAE2, FLAE3 and EC) to calculate fractional scores for the citations and RCR metrics.
ORCID data had been also used to quantify the significance of obtained scores. For instance, it is not enough to say that FLAERCR (or any other score) is equal to 10. Obviously, the bigger the number the better, but it is useful to compare it to some reference. For this purpose we calculated the percentiles for the score in respect to all ORCID portfolios (the value below which a given percentage of observations in a group of observations falls). Note that the percentiles presented by fCite are calculated separately for each individual metrics including division into research and non-research item portfolios (Table 1). Additionally, as the distribution of bibliometric metrics is practically never normal (Gaussian), the percentiles are presented with additional precision (this allow to distinguish similar portfolios, especially top ones).
Auxiliary information
Motivation
The number of the authors steadily increase over last 25 years (Table S5). Moreover, when we divide the publications into research items (describing the original works) and non-research (e.g., the reviews, the editorials, etc.) we clearly can see that on average research publication require more authors. Given the fact that currently the publications with dozens or even hundreds of authors are common, it is desirable to modify the bibliometric metrics (e.g., citations, H-index) to seize the number of the authors for individual paper.
The patterns of authorship versus portfolio size
The ratio between fractional and total metric declines as the size of portfolio increase (Fig. S1). Regardless of the main metric used (RCR or citation) the trend is stable and the data show that small portfolios have more first author publications. When the portfolio size increases more last author items appear. Depending the fractional model used, the small portfolios have 20-25% of contribution of the author falling below 18% for bigger portfolios. Here, it is interesting to point that regardless of fractional model (FLAE vs FLAE2 vs FLAE3 vs EC) the portfolios with around 40-60 items score virtually identical (which means that for mature scientist with multiple publications is not important which fractional model will be used). One should also take into account the spread of the scores which is huge for small size portfolios and decreases over the number of items, but it is always very significant and comprise at least roughly 20% (Fig. S2).
Data S1.
PUBMED data set containing publications with PMID numbers from 7 million to 19 million (json format).
Data S2.
PUBMED data set containing publications with PMID numbers from 20 million to 32 million (json format).
Data S2.
ORCID Public Data File (xml format).
https://figshare.com/articles/ORCID_Public_Data_File_2018/7234028
Acknowledgments
The author acknowledge the NCBI, NLM, ORCID and iCite teams.