Is searching full text more effective than searching abstracts?

Jimmy Lin

doi:10.1186/1471-2105-10-46

Is searching full text more effective than searching abstracts?

BMC Bioinformatics. 2009 Feb 3:10:46. doi: 10.1186/1471-2105-10-46.

Author

Jimmy Lin¹

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, USA. jimmylin@umd.edu

Abstract

Background: With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine.

Results: Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles.

Conclusion: Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.

Publication types

Comparative Study
Research Support, N.I.H., Intramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Abstracting and Indexing
Algorithms*
Information Storage and Retrieval / methods*
MEDLINE
Terminology as Topic

Grants and funding

Intramural NIH HHS/United States