arXiv Bulk Access Dataset

The arXiv Bulk Access Dataset provides comprehensive metadata and full-text access to over 2 million scientific preprints hosted on arXiv.org, the pioneering open-access preprint repository founded in 1991. The dataset spans physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, and electrical engineering. It is widely used for bibliometric analysis, scientometric research, natural language processing, and studies of scientific collaboration networks. The availability of both structured metadata and source documents makes it an invaluable resource for understanding patterns of knowledge production, citation dynamics, and the evolution of scientific disciplines over more than three decades.

✏️ Suggest an edit View Source ↗
KAGGLE View on platform → Cornell-University/arxiv

The arXiv Bulk Access Dataset provides comprehensive metadata and full-text access to over 2 million scientific preprints hosted on arXiv.org, the pioneering open-access preprint repository founded in 1991. The dataset spans physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, and electrical engineering. It is widely used for bibliometric analysis, scientometric research, natural language processing, and studies of scientific collaboration networks. The availability of both structured metadata and source documents makes it an invaluable resource for understanding patterns of knowledge production, citation dynamics, and the evolution of scientific disciplines over more than three decades. It is particularly relevant in Software Engineering, Social Sciences, Citizen Science and Education.

arXiv Bulk Access Dataset supports open source and distributed collaboration and is suited for community-scale initiatives in remote settings.

arXiv Bulk Access Dataset is classified as a well-documented dataset, indicating broad adoption and available documentation. The dataset is hosted on kaggle, in XML metadata, PDF/LaTeX source format.