2nd Summer Institute in Statistics for Big Data

Module 3: Reproducible Research for Biomedical Big Data

Week 2, Session 3, Monday 8:30 AM - Wednesday 12:00 PM: Mon Jul 18 to Wed Jul 20
Instructor(s):

The validity of conclusions from scientific investigations is typically strengthened by the replication of results by independent researchers. Full replication of a study’s results using independent methods, data, equipment, and protocols has long been, and will continue to be, the standard by which scientific claims are evaluated. However, in many fields of study, there are examples of scientific investigations which cannot be fully replicated, often because of a lack of time or resources. In such situations, there is a need for a minimum standard which can serve as an intermediate step between full replication and nothing. This minimum standard is reproducible research, which requires that datasets and computer code be made available to others for verifying published results and conducting alternate analyses. This standard is especially important in the context of biomedical research, where the results may determine patient care. Unfortunately, reviews of the current literature suggest that this “standard” is anything but. Examples of non-reproducible research resulting in improper treatment of patients have driven journals, funding agencies, and regulatory agencies to press for a greater standard of reproducibility. In this module, we will provide examples of systemic breakdowns demonstrating the need for reproducible research, and an introduction to tools for conducting reproducible research. Topics covered will include the types of breakdowns most commonly seen, current regulatory requests, literate statistical programming techniques, reproducible statistical computation, and techniques for making large-scale data analyses reproducible. We will focus on the R statistical computing language, and will discuss other tools that can be used for producing reproducible documents. We will assume some familiarity with R.

Recommended Reading: Gandrud (2015) Reproducible Research with R and RStudio (2e).