2018 SISBID Modules

Registration opens February 1, 2018. Please note scholarships will not available for SISBID 2018.

Module 1: Data Wrangling with R

Session 2: Wednesday, July 11, 1:30 p.m.-5 p.m.; Thursday, July 12, 8:30 a.m.-5 p.m.; Friday, July 13, 8:30 a.m.-5 p.m. 

Instructor(s): Andrew Jaffe, Johns Hopkins UniversityJeffrey Leek, Johns Hopkins University

Participants will learn how to get data and process it for visualization and statistical analysis. Our approach focuses on the concept of creating “tidy data”, e.g. data that is organized into readable and distributable files. In this module, we will:

  • Use hands-on examples from published studies and cover concepts on data retrieval, manipulation, and formatting.
  • Touch on reproducible research using R Markdown and collaborative code sharing using GitHub.
  • Briefly introduce some of the most popular public data repositories in genomics (e.g. GEO, SRA), and demonstrate how to access these repositories using tools in R and Bioconductor (e.g. using the recount and GEOquery packages).

Principles will be illustrated using data from microarray and next generation sequencing technologies.

Module assumes some familiarity with R.

Recommended Reading: Cookbook for R, by Winston Chang, available at www.cookbook-r.com.

Module 2: Reproducible Research for Biomedical Big Data

Week 2, Session 3: Monday, July 16, 8:30 a.m.-5 p.m.; Tuesday, July 17, 8:30 a.m.-5 p.m.; Wednesday, July 18, 8:30 a.m.-Noon

Instructors: Keith Baggerly, University of Texas MD Anderson Cancer Research CenterKarl Broman, University of Wisconsin-Madison

The validity of conclusions from scientific investigations is typically strengthened by the replication of results by independent researchers. Full replication of a study’s results using independent methods, data, equipment, and protocols has long been, and will continue to be, the standard by which scientific claims are evaluated. However, in many fields of study, there are examples of scientific investigations which cannot be fully replicated, often because of a lack of time or resources.

In such situations, there is a need for a minimum standard which can serve as an intermediate step between full replication and nothing. This minimum standard is reproducible research, which requires that datasets and computer code be made available to others for verifying published results and conducting alternate analyses. This standard is especially important in the context of biomedical research, where the results may determine patient care.

Unfortunately, reviews of the current literature suggest that this “standard” is anything but. Examples of non-reproducible research resulting in improper treatment of patients have driven journals, funding agencies, and regulatory agencies to press for a greater standard of reproducibility.

In this module, we will provide examples of systemic breakdowns demonstrating the need for reproducible research, and an introduction to tools for conducting reproducible research. Topics covered will include the types of breakdowns most commonly seen, current regulatory requests, literate statistical programming techniques, reproducible statistical computation, and techniques for making large-scale data analyses reproducible.

We will focus on the R statistical computing language, and will discuss other tools that can be used for producing reproducible documents. Module assumes some familiarity with R.

Recommended ReadingGandrud (2015) Reproducible Research with R and RStudio (2e).

Module 3: Supervised Methods for Statistical Machine Learning

Week 2, Session 4: Wednesday, July 18, 1:30-5 p.m.; Thursday, July 19, 8:30 a.m.-5 p.m.; Friday, July 20, 8:30 a.m.-5 p.m. 

Instructor(s): Noah Simon, University of Washington; Ali Shojaie, University of Washington

In this module, we will present a number of supervised learning techniques for the analysis of Biomedical Big Data. These techniques include penalized approaches for performing regression, classification, and survival analysis with Big Data. Support vector machines, decision trees, and random forests will also be covered.

The main emphasis will be on the analysis of “high-dimensional” data sets from genomics, transcriptomics, metabolomics, proteomics, and other fields. These data are typically characterized by a huge number of molecular measurements (such as genes) and a relatively small number of samples (such as patients). We will also consider electronic health record data sets, which often contain many missing measurements.

Throughout the course, we will focus on common pitfalls in the supervised analysis of Biomedical Big Data and how to avoid them. The techniques discussed will be demonstrated in R.

This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language.

Recommended ReadingJames et al. (2013) Introduction to Statistical Learning. Springer Series in Statistics. Available for free download at www.statlearning.com.

Module 4: Unsupervised Methods for Statistical Machine Learning

Week 3, Session 5: Monday, July 23, 8:30 a.m.-5 p.m.; Tuesday, July 24, 8:30 a.m.-5 p.m.; Wednesday, July 25, 8:30 a.m.-Noon 

Instructor(s): Genevera Allen, Rice UniversityYufeng Liu, University of North Carolina

In this module, we will present a number of unsupervised learning techniques for finding patterns and associations in Biomedical Big Data. These include dimension reduction techniques such as principal components analysis and non-negative matrix factorization, clustering analysis, and network analysis with graphical models.

We will also discuss large-scale inference issues, such as multiple testing, that arise when mining for associations in Biomedical Big Data. As in Module 4 on supervised learning, the main emphasis will be on the analysis of real high-dimensional data sets from various scientific fields, including genomics and biomedical imaging. The techniques discussed will be demonstrated in R.

This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language.

Recommended ReadingJames et al. (2013) Introduction to Statistical Learning. Springer Series in Statistics. Available for free download at www.statlearning.com.

Module 5: Visualization of Biomedical Big Data

Week 3, Session 6: Wednesday, July 25, 1:30-5 p.m.; Thursday, July 26, 8:30 a.m.-5 p.m.; Friday, July 27, 8:30 a.m.-Noon 

Instructors: Dianne Cook, Monash University; Heike Hofmann, Iowa State University

We will present general-purpose techniques for visualizing any sort of large data sets, as well as specific techniques for visualizing common types of biological data sets. Often the challenge of visualizing Big Data is to aggregate it down to a suitable level. Understanding Big Data involves an iterative cycle of visualization and modeling. We will illustrate this with several case studies during the workshop.

The first segment of this module will focus on structured development of graphics using static graphics. This will use the ggplot2 package in R. It enables building plots using grammatically defined elements, and producing templates for use with multiple data sets. We will show how to extend these principles for genomic data using the ggplot2-based ggbio package.

The second segment will focus on interactive graphics for rapid exploration of Big Data. We will also demonstrate interactive techniques for high-performance local display using cranvas, and for easily creating interactive web graphics with ggvis. In addition, we will explain how to create simple web GUIs for managing complex summaries of biological data using the shiny package.

We will use a hands-on teaching methodology that combines short lectures with longer practice sessions. As students learn about new techniques, they will also be able to put them into practice and receive feedback from experts. We will teach using R and Rstudio.

Module assumes some familiarity with R.

Recommended ReadingCookbook for R, by Winston Chang, available at www.cookbook-r.com.