GH Archive

GH Archive is a project that records the public GitHub timeline and makes it available for further analysis. It captures all public GitHub events — including commits, pull requests, issues, forks, stars, and comments — and archives them as compressed JSON files. The dataset is widely used in empirical software engineering research to study collaboration patterns, contributor dynamics, project health, and open-source ecosystem evolution. It is available via Google BigQuery for large-scale analysis.

✏️ Suggest an edit View Source ↗

GitHub Repository Stats

2982 Stars
🍴 222 Forks
🐞 31 Issues
🕒 2025-05-25 Updated

Updated: 2026-03-15T20:07:50.020543+00:00

GOOGLE-BIGQUERY View on platform → bigquery-public-data.github_repos

GH Archive is a project that records the public GitHub timeline and makes it available for further analysis. It captures all public GitHub events — including commits, pull requests, issues, forks, stars, and comments — and archives them as compressed JSON files. The dataset is widely used in empirical software engineering research to study collaboration patterns, contributor dynamics, project health, and open-source ecosystem evolution. It is available via Google BigQuery for large-scale analysis. It is particularly relevant in Software Engineering and Citizen Science.

GH Archive supports open source and distributed collaboration and is suited for community-scale initiatives and multi-organization networks in remote settings.

GH Archive is an established dataset with a solid track record of use across multiple contexts. The dataset is hosted on google-bigquery, in JSON (gzip compressed) format.