Module 1: Accessing Biomedical Big Data
Instructors: Raphael Gottardo and Jeffrey Leek
Module description: In this module, we will introduce some of the most popular public data repositories (e.g. GEO, SRA), and will demonstrate how to access these repositories using tools in R and Bioconductor (e.g. using the GEOquery package). We will focus on data retrieval, manipulation, and formatting. Participants will learn how to get datasets and biological annotations ready for visualization and statistical analysis. Our approach will focus on the concept of “tidy data”: data that is organized into readable and distributable files. We will use hands-on examples from published studies. Principles will be illustrated using data from microarray and next generation sequencing technologies. We will assume some familiarity with R. Recommended Reading: Cookbook for R, by Winston Chang, available at www.cookbook-r.com.
Module 2: Visualization of Biomedical Big Data
Instructors: Hadley Wickham and Dianne Cook
Module description: In this module, we will present general-purpose techniques for visualizing any sort of large data sets, as well as specific techniques for visualizing common types of biological data sets. Often the challenge of visualizing Big Data is to aggregate it down to a suitable level. Understanding Big Data involves an iterative cycle of visualization and modeling. We will illustrate this with several case studies during the workshop. The first segment of this module will focus on structured development of graphics using static graphics. This will use the ggplot2 package in R. It enables building plots using grammatically defined elements, and producing templates for use with multiple data sets. We will show how to extend these principles for genomic data using the ggplot2-based ggbio package. The second segment will focus on interactive graphics for rapid exploration of Big Data. We will also demonstrate interactive techniques for high-performance local display using cranvas, and for easily creating interactive web graphics with ggvis. In addition we will explain how to create simple web GUIs for managing complex summaries of biological data using the shiny package. We will use a hands-on teaching methodology that combines short lectures with longer practice sessions. As students learn about new techniques, they will also be able to put them into practice and receive feedback from experts. We will teach using R and Rstudio. We will assume some familiarity with R. Recommended Reading: Cookbook for R, by Winston Chang, available at www.cookbook-r.com.
Module 3: Supervised Methods for Statistical Machine Learning
Instructors: Noah Simon and Daniela Witten
Module description: In this module, we will present a number of supervised learning techniques for the analysis of Biomedical Big Data. These techniques include penalized approaches for performing regression, classification, and survival analysis with Big Data. Support vector machines, decision trees, and random forests will also be covered. The main emphasis will be on the analysis of “high-dimensional” data sets from genomics, transcriptomics, metabolomics, proteomics, and other fields. These data are typically characterized by a huge number of molecular measurements (such as genes) and a relatively small number of samples (such as patients). We will also consider electronic health record data sets, which often contain many missing measurements. Throughout the course, we will focus on common pitfalls in the supervised analysis of Biomedical Big Data, and how to avoid them. The techniques discussed will be demonstrated in R. This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language. Recommended Reading: James et al. (2013) Introduction to Statistical Learning. Springer Series in Statistics. Available for free download at www.statlearning.com.
Module 4: Unsupervised Methods for Statistical Machine Learning
Instructors: Genevera Allen and Yufeng Liu
Module description: In this module, we will present a number of unsupervised learning techniques for finding patterns and associations in Biomedical Big Data. These include dimension reduction techniques such as principal components analysis and non-negative matrix factorization, clustering analysis, and network analysis with graphical models. We will also discuss large-scale inference issues, such as multiple testing, that arise when mining for associations in Biomedical Big Data. As in Module 3 on supervised learning, the main emphasis will be on the analysis of real high-dimensional data sets from various scientific fields, including genomics and biomedical imaging. The techniques discussed will be demonstrated in R. This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language. Recommended Reading: James et al. (2013) Introduction to Statistical Learning. Springer Series in Statistics. Available for free download at www.statlearning.com.
Module 5: Reproducible Research for Biomedical Big Data
Instructors: Keith Baggerly and Roger Peng
Module description: The validity of conclusions from scientific investigations is typically strengthened by the replication of results by independent researchers. Full replication of a study’s results using independent methods, data, equipment, and protocols has long been, and will continue to be, the standard by which scientific claims are evaluated. However, in many fields of study, there are examples of scientific investigations which cannot be fully replicated, often because of a lack of time or resources. In such situations, there is a need for a minimum standard which can serve as an intermediate step between full replication and nothing. This minimum standard is reproducible research, which requires that datasets and computer code be made available to others for verifying published results and conducting alternate analyses. This standard is especially important in the context of biomedical research, where the results may determine patient care. Unfortunately, reviews of the current literature suggest that this “standard” is anything but. Examples of non-reproducible research resulting in improper treatment of patients have driven journals, funding agencies, and regulatory agencies to press for a greater standard of reproducibility. In this module, we will provide examples of systemic breakdowns demonstrating the need for reproducible research, and an introduction to tools for conducting reproducible research. Topics covered will include the types of breakdowns most commonly seen, current regulatory requests, literate statistical programming techniques, reproducible statistical computation, and techniques for making large-scale data analyses reproducible. We will focus on the R statistical computing language, and will discuss other tools that can be used for producing reproducible documents. We will assume some familiarity with R. Recommended Reading: Gandrud (2013) Reproducible Research with R & R Studio.