Two of the richest publicly available resources for studying computer-supported cooperative work sit largely unconnected: the GH Archive — a continuous record of every public GitHub event since 2011 — and the ACM CSCW Proceedings Corpus — three decades of peer-reviewed research on collaborative systems. Combining them opens up compelling mixed-methods research designs.
The datasets at a glance
The GH Archive captures structured event data: push events, pull request reviews, issue comments, fork actions. At scale it reveals the rhythms of distributed collaboration — who reviews whom, how quickly patches are merged, how contributor networks evolve over time.
The ACM CSCW Proceedings Corpus provides the conceptual vocabulary: what researchers have said about collaboration, which constructs have gained traction, and how the field’s attention has shifted from synchronous groupware to asynchronous platforms.
A combined research design
Step 1 — Select a phenomenon
Choose a collaboration pattern visible in both datasets. For example: code review norms in large open-source projects. The Linux Kernel Development case study is a natural anchor — it is both extensively documented in CSCW literature and richly represented in GH Archive data.
Step 2 — Quantitative layer (GH Archive)
Use BigQuery to extract pull request review events for a set of target repositories over a defined time window. Key metrics:
- Review turnaround time
- Reviewer diversity (number of unique reviewers per contributor)
- Comment-to-merge ratio
Jupyter Notebook is the ideal environment for this analysis — BigQuery integrates via google-cloud-bigquery, and the results can be visualised inline with matplotlib or altair.
Step 3 — Qualitative layer (CSCW Corpus)
Run a topic model (LDA or BERTopic) over CSCW abstracts filtered to papers about code review or patch submission. Map the evolution of dominant topics across conference years to contextualise the quantitative trends.
Step 4 — Triangulation
Overlay the quantitative timeline with the qualitative topic shifts. Did changes in review behaviour precede or follow changes in how researchers framed the phenomenon?
Reproducibility considerations
- Pin your BigQuery SQL queries and export schemas in the notebook
- Archive the CSCW corpus snapshot used (the corpus is updated annually)
- Use
papermillto parameterise the notebook for different repository sets
External resources
The GH Archive data is accessible at gharchive.org and mirrored on Google BigQuery as githubarchive.day.*. The CSCW corpus is available through the ACM Digital Library; institutional access is required for full-text.