GH Archive is a project that records the public GitHub timeline and makes it available for further analysis. It captures all public GitHub events — including commits, pull requests, issues, forks, stars, and comments — and archives them as compressed JSON files. The dataset is widely used in empirical software engineering research to study collaboration patterns, contributor dynamics, project health, and open-source ecosystem evolution. It is available via Google BigQuery for large-scale analysis. It is particularly relevant in Software Engineering and Citizen Science.
GH Archive supports open source and distributed collaboration and is suited for community-scale initiatives and multi-organization networks in remote settings.
GH Archive is an established dataset with a solid track record of use across multiple contexts. The dataset is hosted on google-bigquery, in JSON (gzip compressed) format.