. 2023 Jun 7;10(1):366.

doi: 10.1038/s41597-023-02208-w.

CORE: A Global Aggregation Service for Open Access Papers

Petr Knoth¹, Drahomira Herrmannova^{2

3}, Matteo Cancellieri², Lucas Anastasiou², Nancy Pontika², Samuel Pearce², Bikash Gyawali², David Pride²

Affiliations

¹ Knowledge Media Institute, The Open University Walton Hall, Milton Keynes, UK. petr.knoth@open.ac.uk.
² Knowledge Media Institute, The Open University Walton Hall, Milton Keynes, UK.
³ Oak Ridge National Laboratory Oak Ridge, Oak Ridge, TN, USA.

PMID: 37286585
PMCID: PMC10247729
DOI: 10.1038/s41597-023-02208-w

CORE: A Global Aggregation Service for Open Access Papers

Petr Knoth et al. Sci Data. 2023.

. 2023 Jun 7;10(1):366.

doi: 10.1038/s41597-023-02208-w.

Authors

Petr Knoth¹, Drahomira Herrmannova^{2

3}, Matteo Cancellieri², Lucas Anastasiou², Nancy Pontika², Samuel Pearce², Bikash Gyawali², David Pride²

Affiliations

¹ Knowledge Media Institute, The Open University Walton Hall, Milton Keynes, UK. petr.knoth@open.ac.uk.
² Knowledge Media Institute, The Open University Walton Hall, Milton Keynes, UK.
³ Oak Ridge National Laboratory Oak Ridge, Oak Ridge, TN, USA.

PMID: 37286585
PMCID: PMC10247729
DOI: 10.1038/s41597-023-02208-w

Abstract

This paper introduces CORE, a widely used scholarly service, which provides access to the world's largest collection of open access research publications, acquired from a global network of repositories and journals. CORE was created with the goal of enabling text and data mining of scientific literature and thus supporting scientific discovery, but it is now used in a wide range of use cases within higher education, industry, not-for-profit organisations, as well as by the general public. Through the provided services, CORE powers innovative use cases, such as plagiarism detection, in market-leading third-party organisations. CORE has played a pivotal role in the global move towards universal open access by making scientific knowledge more easily and freely discoverable. In this paper, we describe CORE's continuously growing dataset and the motivation behind its creation, present the challenges associated with systematically gathering research papers from thousands of data providers worldwide at scale, and introduce the novel solutions that were developed to overcome these challenges. The paper then provides an in-depth discussion of the services and tools built on top of the aggregated data and finally examines several use cases that have leveraged the CORE dataset and services.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Example uses cases of text and data mining of scientific literature. Depending on data needs, TDM uses can be categorised into a) a priori defined sample use cases, and b) undefined sample use cases. Furthermore, TDM use cases can broadly be categorised into 1) indirect applications which aim to improve access to and organisation of literature and 2) direct applications which focus on answering specific questions or gaining insights.

**Fig. 2**
Example illustration of the data collection process. The figure depicts the typical minimum steps which are necessary to produce a dataset for TDM of scientific literature. Depending on the use case, tens or hundreds of different data sources may need to be accessed, each potentially requiring a different process–for example accessing a different set of API methods or a different process for downloading publication full text. Furthermore, depending on the use case, additional steps may be needed, such as extraction of references, identification of duplicate items or detection of the publication’s language. In the context of CORE, we provide the details of this process in Section Methods.

**Fig. 3**
Growth of records in CORE per month since February 2012. “Full text growth” represents growth of records containing full text, while “Metadata growth” represents growth of records without full text, i.e. the two numbers do not overlap. The two area plots are stacked on top of each other, their sum therefore represents the total number of records in CORE.

**Fig. 4**
Age of publications in CORE. Similarly as in Fig. 3, the “Metadata” and “Full text” records bars are stacked on top of each other.

**Fig. 5**
Top ten languages and top ten provider locations in CORE.

**Fig. 6**
Distribution of document types.

**Fig. 7**
Subject distribution of a sample of 20,758,666 CORE publications.

**Fig. 8**
CORE Harvesting Pipeline. Each tasks’ output produces the input for the following task. In some cases the input is considered as a whole, for example all the content harvested from a data provider, while in other cases, the output is split in multiple small tasks performed on a record level.

See this image and copyright information in PMC

References

1. Bornmann L, Mutz R. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. JASIST. 2015;66(11):2215–2222.
1. Piwowar H, et al. The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles. PeerJ. 2018;6:e4375. doi: 10.7717/peerj.4375. - DOI - PMC - PubMed
1. Saggion, H. & Ronzano, F. Scholarly data mining: making sense of scientific literature. 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL): 1–2 (2017).
1. Kim E, et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chemistry of Materials. 2017;29(21):9436–9444. doi: 10.1021/acs.chemmater.7b03500. - DOI
1. Jacobs, N. & Ferguson, N. Bringing the UK’s open access research outputs together: Barriers on the Berlin road to open access. Jisc Repository (2014).

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CORE: A Global Aggregation Service for Open Access Papers

Affiliations

CORE: A Global Aggregation Service for Open Access Papers

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources