Module 1: Probability and Statistical Inference
Instructors: Jim Hughes and David Yanez
Module description: This module covers the laws of probability and the binomial, multinomial, and normal distributions. It covers descriptive statistics and methods of inference, including maximum likelihood, confidence intervals and simple Bayes methods. Classical hypothesis testing topics, including type I and II errors, two-sample tests, chi-square tests and contingency table analysis, and exact and permutation tests. Resampling methods, such as the bootstrap and jackknife, are covered as well. This module serves as a foundation for almost all of the later modules. Co-taught with the Summer Institute in Statistics and Modeling in Infectious Diseases (SISMID 2015).
Module 2: Molecular Genetics and Genomics
Instructors: Greg Gibson and Christine Queitsch
Module description: The molecular genetics and genomics module covers the theory and practice of modern genetics. It is designed to provide biologists with the foundations upon which statistical genetics is built, and/or an introduction to the concepts of classical and contemporary genetics for statisticians and informaticians. We start with the key concepts of quantitative and Mendelian genetics and then illustrate how these have been reconciled with molecular biology. Two half days are then spent on the basics of genome-wide association mapping as well as exome and whole genome sequencing; and on gene expression profiling and integrative genomics leading to systems biology. On the final afternoon the instructors provide their perspectives on the future of personalized medicine, and on evolutionary genetics.
Module 3: Introduction to R
Instructors: Ken Rice and Tim Thornton
Module description: This module introduces the R statistical environment, assuming no prior knowledge. It provides a foundation for the use of R for computation in later modules. In addition to discussing basic data management tasks in R, such as reading in data and producing summaries through R scripts, we will also introduce R’s graphics functions, its powerful package system, and simple methods of looping. Examples and exercises will use data drawn from biological and medical applications, including infectious diseases and genetics. Hands-on use of R is a major component of this module; users require a laptop and will use it in all sessions. Co-taught with the Summer Institute in Statistics and Modeling in Infectious Diseases (SISMID 2015).
Module 4 (SISBID Module 1): Accessing Biomedical Big Data
Instructors: Raphael Gottardo and Jeffrey Leek
Module description: In this module, we will introduce some of the most popular public data repositories (e.g. GEO, SRA), and will demonstrate how to access these repositories using tools in R and Bioconductor (e.g. using the GEOquery package). We will focus on data retrieval, manipulation, and formatting. Participants will learn how to get datasets and biological annotations ready for visualization and statistical analysis. Our approach will focus on the concept of “tidy data”: data that is organized into readable and distributable files. We will use hands-on examples from published studies. Principles will be illustrated using data from microarray and next generation sequencing technologies. We will assume some familiarity with R. Recommended Reading: Cookbook for R, by Winston Chang, available at www.cookbook-r.com. Co-listed with Summer Institute in Statistics for Big Data (SISBID 2015).
Module 5: Regression and Analysis of Variance
Instructors: Rebecca Hubbard and Lurdes Inoue
Module description: This module is designed as a foundation for the quantitative genetics and QTL modules as well as for the association mapping modules. It assumes the material in Module 1 and it will cover the basic commands in R. It covers linear regression and analysis of variance. This module includes both lectures and interactive data analysis using R. Specific topics discussed are: simple linear regression; multiple linear regression; residual analysis; transformations; one-way ANOVA; two-way ANOVA; analysis of covariance; multiple comparisons. Assumes the material in Module 1, Probability and Statistical Inference. Module 3, Introduction to R, would be helpful.
Module 6: Gene Expression Profiling
Instructors: Greg Gibson and Jeffrey Leek
The gene expression module will cover the theory and application of transcriptomics, including both microarray and RNA-Seq methodologies. The focus of the module is on the statistical basis of hypothesis testing, covering the central role of normalization strategies with the opportunity for students to work examples using the SNM module in R, as well as the fundamentals of ANOVA and the False Discovery Rate procedure. In addition, we will discuss options for downstream processing by clustering and module detection, finishing with expression QTL analysis and integrative genomics to infer pathway structure.
Module 7: Elements of R for Genetics and Bioinformatics
Instructors: Thomas Lumley and Ken Rice
Module description: This module introduces programming skills required for analysis of genetic data, in the R statistical environment. The module assumes prior knowledge of R, to the level described in Module 6. We will briefly review how R scripts are built up from interactive commands, and then discuss how to turn these into R programs based on user-defined functions. We will cover R’s debugging system, its tools for enhancing efficiency of code, and its methods for handling errors and warnings. In many genomic applications, R packages already exist to perform specialized statistical and bioinformatic analyses, and we will introduce packages for handling large datasets, and the Bioconductor repository of genomic packages. Assumes some prior familiarity with R, to the level covered in Module 3. Users require a laptop and will use it in all sessions.
Module 8 (SISBID Module 2): Visualization of Biomedical Big Data
Instructors: Hadley Wickham and Dianne Cook
Module description: In this module, we will present general-purpose techniques for visualizing any sort of large data sets, as well as specific techniques for visualizing common types of biological data sets. Often the challenge of visualizing Big Data is to aggregate it down to a suitable level. Understanding Big Data involves an iterative cycle of visualization and modeling. We will illustrate this with several case studies during the workshop. The first segment of this module will focus on structured development of graphics using static graphics. This will use the ggplot2 package in R. It enables building plots using grammatically defined elements, and producing templates for use with multiple data sets. We will show how to extend these principles for genomic data using the ggplot2-based ggbio package. The second segment will focus on interactive graphics for rapid exploration of Big Data. We will also demonstrate interactive techniques for high-performance local display using cranvas, and for easily creating interactive web graphics with ggvis. In addition we will explain how to create simple web GUIs for managing complex summaries of biological data using the shiny package. We will use a hands-on teaching methodology that combines short lectures with longer practice sessions. As students learn about new techniques, they will also be able to put them into practice and receive feedback from experts. We will teach using R and Rstudio. We will assume some familiarity with R. Recommended Reading: Cookbook for R, by Winston Chang, available at www.cookbook-r.com. Co-listed with Summer Institute in Statistics for Big Data (SISBID 2015).
Module 9: Population Genetic Data Analysis
Instructors: Jérôme Goudet and Bruce Weir
Module description: This module serves as a foundation for many of the later modules. Estimates and sample variances of allele frequencies, Hardy-Weinberg and linkage disequilibrium, characterization of population structure with F-statistics. Relationship estimation. Statistical genetic aspects of forensic science and association mapping. Concepts illustrated with R exercises. Assumes some familiarity with R, to the level covered In Module 3. Background reading: Holsinger, K. and Weir, B.S., 2009. Genetics in geographically structured populations: defining, estimating, and interpreting FST. Nature Reviews Genetics 10:639–650. Weir, B.S. and C.C. Laurie. 2011. Statistical genetics in the genome era. Genetics Research 92:461–470.
Module 10: Quantitative Genetics
Instructors: Bill Muir and Bruce Walsh
Module description: Assumes the material in Modules 1, 5 and 8. Provides a foundation for Modules 14, 15 and 23. Quantitative Genetics is the analysis of complex characters where both genetic and environment factors contribute to trait variation. Since this includes most traits of interest, such as disease susceptibility, crop yield, and all microarray data, a working knowledge of quantitative genetics is critical in diverse fields from plant and animal breeding, human genetics, genomics, to ecology and evolutionary biology. The course will cover the basics of quantitative genetics including: Fishers variance decomposition, covariance between relatives, heritability, inbreeding and crossbreeding, and response to selection. Also an introduction to advanced topics such as: Mixed Models, BLUP, QTL mapping; correlated characters; and the multivariate response to selection. Assumes the material in Module 1, Probability and Statistical Inference, and Module 5, Regression and Analysis of Variance. Background reading: Lynch, M. and Walsh, B., 1998. Genetics and analysis of quantitative traits. Sinauer Associates.
Module 11: Genetic Epidemiology
Instructors: Karen Edwards and Carolyn Hutter
Module description: Assumes the material in Modules 1 and 8. This model will provide an overview of genetic epidemiology, with a focus on design, analysis and interpretation in studies of complex disease. The focus is on methods for discovering how genetic factors influence health and disease. Topics covered will include twin studies, family studies, segregation analysis, linkage analysis, population-based association studies, and gene-environment interactions. The course will include practical examples using on-line resources and statistical analysis programs. Assumes the material in Module 1, Probability and Statistical Inference. Background reading: M.A. Austin (editor). 2013. Genetic Epidemiology: methods and applications.
Module 12 (SISBID Module 3): Supervised Methods for Statistical Machine Learning
Instructors: Noah Simon and Daniela Witten
Module description: In this module, we will present a number of supervised learning techniques for the analysis of Biomedical Big Data. These techniques include penalized approaches for performing regression, classification, and survival analysis with Big Data. Support vector machines, decision trees, and random forests will also be covered. The main emphasis will be on the analysis of “high-dimensional” data sets from genomics, transcriptomics, metabolomics, proteomics, and other fields. These data are typically characterized by a huge number of molecular measurements (such as genes) and a relatively small number of samples (such as patients). We will also consider electronic health record data sets, which often contain many missing measurements. Throughout the course, we will focus on common pitfalls in the supervised analysis of Biomedical Big Data, and how to avoid them. The techniques discussed will be demonstrated in R. This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language. Recommended Reading: James et al. (2013) Introduction to Statistical Learning. Springer Series in Statistics. Available for free download at www.statlearning.com. Co-listed with Summer Institute in Statistics for Big Data (SISBID 2015).
Module 13: Association Mapping: GWAS and Sequencing Data
Instructors: Timothy Thornton and Michael Wu
Module description: This module assumes the material in Module 1 and it serves as the foundation for many later modules. Topics covered include: basic probability and Mendelian genetics; Hardy-Weinberg equilibrium; inbreeding coefficients; population structure; recombination and genetic linkage; linkage disequilibrium; measures of relatedness; haplotype frequency estimation with unphased genotypes, genetic association testing; association testing in the presence of population structure and/or relatedness. Many methods are illustrated with implementation in R, and the module is most useful for students with basic familiarity with R. Other public domain software will be introduced, such as HAPLOVIEW, LOCUSZOOM, and PLINK. There is substantial overlap with Module 8. Assumes the material in Module 1, Probability and Statistical Inference, and some familiarity with R, to the level covered In Module 3. Background reading: Weir, B.S. 1996. Genetic Data Analysis II. Sinauer Associates; Thornton and McPeek. 2010 ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure. American Journal of Human Genetics 86:172–184.
Module 14: QTL Mapping
Instructors: Rebecca Doerge and Zhao-Bang Zeng
Module description: Assumes the material in Modules 1,5 and 11. Material in Module 8 would be helpful. This module will systematically introduce statistical methods for mapping quantitative trait loci (QTL) in experimental cross populations. Topics include experimental designs, linkage map construction, single-marker analyses, interval mapping, composite interval mapping and multiple interval mapping. Significance thresholds for genome scan and model selection will also be discussed. Uses public domain software Windows QTL-Cartographer for computer lab exercises. Emphasis is on procedures for QTL mapping data analysis and appropriate interpretation of mapping results rather than on formulas. Assumes the material in Module 1, Probability and Statistical Inference, Module 5, Regression and Analysis of Variance, and Module 10, Quantitative Genetics. Users require a laptop and will use it in all sessions.
Module 15: Mixed Models in Quantitative Genetics
Instructors: Bill Muir and Bruce Walsh
Module description: The analysis of linear models containing both fixed and random effects. Topics to be discussed include a basic matrix algebra review, the general linear model, derivation of the mixed model, BLUP and REML estimation, estimation and design issues, Bayesian formulations. Applications to be discussed include estimation of breeding values and genetic variances in general pedigrees, association mapping, genomic selection, direct and associative effects models of general group and kin selection, genotype by environment interaction models. Assumes the material in Module 1, Probability and Statistical Inference, Module 5, Regression and Analysis of Variance.Background reading: Lynch, M. and B. Walsh. 1998. Genetics and analysis of quantitative traits. Sinauer Associates.
Module16 (SISBID Module 4): Unsupervised Methods for Statistical Machine Learning
Instructors: Genevera Allen and Yufeng Liu
Module description: In this module, we will present a number of unsupervised learning techniques for finding patterns and associations in Biomedical Big Data. These include dimension reduction techniques such as principal components analysis and non-negative matrix factorization, clustering analysis, and network analysis with graphical models. We will also discuss large-scale inference issues, such as multiple testing, that arise when mining for associations in Biomedical Big Data. As in Module 3 on supervised learning, the main emphasis will be on the analysis of real high-dimensional data sets from various scientific fields, including genomics and biomedical imaging. The techniques discussed will be demonstrated in R. This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language. Co-listed with Summer Institute in Statistics for Big Data (SISBID 2015).
Module 17: Advanced R Programming for Bioinformatics
Instructors: Thomas Lumley and Ken Rice
Module description: This module covers techniques needed for software development in R. The module includes advanced graphics, object-oriented programming, SQL database use, some of the Bioconductor infrastructure, writing packages, and calling C code from R. The module is aimed at people who have either substantial R experience or programming experience in other languages; Module 9 without some experience would not be sufficient preparation. This module is aimed at people who have either substantial R experience or programming experience in another language. The material in Module 7, Elements of R for Genetics and Bioinformatics (if taken this year) without some experience, would not be sufficient. Users require a laptop and will use it in all sessions.Background reading: Gentleman. R. (2008) R Programming in Bioinformatics. Taylor & Francis.
SISMID Module 11: Introduction to Metagenomic Data Analysis
Instructors: Alexander Alekseyenko and Paul J. McMurdie
Module description: This course is concerned with analysis of microbial community data generated by next-generation sequencing technologies. These high-throughput methods allow for deep surveying of microorganisms inhabiting their biological hosts. We will cover the steps for preprocessing Roche 454 sequencing data, necessary to produce abundance tables. We will then examine methodology for associating microbial abundance data with experimental factors and outcomes. Programming will be done in R, and overview of available programs for preprocessing the data will be provided. Pre-requisites: Module 1: Probability and Statistical Inference. Co-listed with the Summer Institute in Statistics and Modeling in Infectious Diseases (SISMID 2015).
SISMID Module 12: Evolutionary Dynamics and Molecular Epidemiology of Viruses
Instructors: Philippe Lemey and Marc A. Suchard
Module description: This module covers the use of phylogenetic and bioinformatic tools to analyze pathogen genetic variation and to gain insight in the processes that shape their diversity. The module focuses on phylogenies and how these relate to population genetic processes in infectious diseases. In particular, the module will cover Bayesian Evolutionary Analysis by Sampling Trees (BEAST). This software will be used in class exercises that are mainly focused on estimating epidemic time scales, reconstruction changes in viral population sizes through time and inference of spatial diusion of viruses. Evolutionary processes including recombination and selection will also be considered. Assumes the material in Module 1. Co-listed with the Summer Institute in Statistics and Modeling in Infectious Diseases (SISMID 2015).
Module 18: MCMC for Genetics
Instructors: Eric Anderson and Matthew Stephens
Module description: This module examines the use of Bayesian Statistics and Markov chain Monte Carlo methods in modern analyses of genetic data. It assumes a solid foundation in basic statistics and the concept of likelihood as well as some population genetics. A basic familiarity with the R statistical package, or other computing language, will be helpful. The first day includes an introduction to Bayesian statistics, Monte Carlo, and MCMC. Mathematical concepts covered include expectation, laws of large numbers, and ergodic and time-reversible Markov chains. Algorithms include the Metropolis-Hastings algorithm and Gibbs sampling. Some mathematical detail is given; however, there is considerable emphasis on concepts and practical issues arising in applications. Mathematical ideas are illustrated with simple examples and reinforced with a computer practical using the R statistical language. With that background, two applications of MCMC are investigated in detail: inference of population structure (using the program STRUCTURE) and haplotype inference (using the program PHASE). Computer practicals using both programs are included. Further topics include the use of MCMC in model evaluation and model checking, strategies for assessing MCMC convergence and diagnosing MCMC mixing problems, importance sampling, and Metropolis-coupled MCMC. Software used: R, STRUCTURE, PHASE. Assumes a solid foundation in basic statistics and the concept of likelihood, as well some population genetics, at least to the level of the material in Module 1, Probability and Statistical Inference and Module 9, Population Genetic Data Analysis. Module 3, Introduction to R, would be helpful.Background reading: Shoemaker, J.S., I.S. Painter and B.S. Weir. (1999). Bayesian statistics in genetics. Trends in Genetics 15:354–358. Beaumont, M.A. and B. Rannala. (2004). The Bayesian revolution in genetics. Nature Reviews Genetics 5:251–261. Gilks, W.R., S. Richardson and D.J. Spiegelhalter. (1996). “Markov Chain Monte Carlo in Practice.” Chapman and Hall.
Module 19: Statistical & Quantitative Genetics of Disease
Instructors: John Witte and Naomi Wray
Module description: This module focuses on the analysis of genetic data for disease and its interpretation. We will consider in detail the quantitative genetics of binary disease with emphasis on the equivalences and relationships between different models. We will contrast and synthesize the traditional viewpoints of quantitative geneticists and epidemiologists. Topics will include: risk models on different scales the observed (or disease) scale, the liability threshold scale, the log risk scale; estimation of heritability from familial risk ratios; estimation of the contribution of individual and multiple risk loci to disease; estimation of variance attributable to genome-wide SNPs individually and together; and risk profile scoring. Some of the underlying methods can be traced to classic papers, however, whole genome methods applied to SNP data have been recently developed. Assumes the material in Module 7, Elements of R for Genetics and Bioinformatics, and Module 9, Population Genetic Data Analysis, or Module 13, Association Mapping: GWAS and Sequencing Data.
Module 20 (SISBID Module 5): Reproducible Research for Biomedical Big Data
Instructors: Keith Baggerly and Roger Peng
Module description: The validity of conclusions from scientific investigations is typically strengthened by the replication of results by independent researchers. Full replication of a study’s results using independent methods, data, equipment, and protocols has long been, and will continue to be, the standard by which scientific claims are evaluated. However, in many fields of study, there are examples of scientific investigations which cannot be fully replicated, often because of a lack of time or resources. In such situations, there is a need for a minimum standard which can serve as an intermediate step between full replication and nothing. This minimum standard is reproducible research, which requires that datasets and computer code be made available to others for verifying published results and conducting alternate analyses. This standard is especially important in the context of biomedical research, where the results may determine patient care. Unfortunately, reviews of the current literature suggest that this “standard” is anything but. Examples of non-reproducible research resulting in improper treatment of patients have driven journals, funding agencies, and regulatory agencies to press for a greater standard of reproducibility. In this module, we will provide examples of systemic breakdowns demonstrating the need for reproducible research, and an introduction to tools for conducting reproducible research. Topics covered will include the types of breakdowns most commonly seen, current regulatory requests, literate statistical programming techniques, reproducible statistical computation, and techniques for making large-scale data analyses reproducible. We will focus on the R statistical computing language, and will discuss other tools that can be used for producing reproducible documents. Recommended Reading: Gandrud (2013) Reproducible Research with R & R Studio. Co-listed with Summer Institute in Statistics for Big Data (SISBID 2015).
Module 21: Forensic Genetics
Instructors: John Buckleton, Simone Gittelson and Bruce Weir
Module description: Although the use of genetic profiles is now routine, there are still issues for which care must be taken when attaching statistics to matching profiles. These include the effects of relatives and population structure, the use of lineage markers and the interpretation of mixtures. An introductory account will be given of the use of likelihood ratios, familial searching, the “birthday problem” and Y-STR profiles. A discussion of the use of peak heights and allelic dropout will be given.
Module 22: Bayesian Statistics for Genetics
Instructors: Peter Hoff and Jon Wakefield
Module description: The use of Bayesian methods in genetics has a long history. In this introductory module we will begin by discussing introductory probability. We will then describe Bayesian approaches to binomial proportions, multinomial proportions, two-sample comparisons (binomial, Poisson, normal), the linear model, and Monte Carlo methods of summarization. Advanced topics will be touched on, including hierarchical models, generalized linear models, and missing data. Illustrative applications will include: Hardy-Weinberg testing and estimation, detection of allele-specific expression, QTL mapping, testing in genome-wide association studies, mixture models, multiple testing in high throughput genomics. Background Reading: P.D. Hoff (2009). A First Course in Bayesian Statistical Methods. Springer-Verlag.
Module 23: Advanced Quantitative Genetics
Instructors: Mike Goddard and Peter Visscher
Module description: This module assumes the material in Modules 1 and 6. Material in Modules 9, 11 and 15 would be helpful. This module focuses on the genetics and analysis of quantitative traits in human populations, with emphasis on estimation and prediction analysis using genetic markers. It is a good match with Module 19 that deals with similar topics but with a focus on disease (binary) outcomes. Topics include: the resemblance between relatives; estimation of genetic variance associated with genome-wide identity by descent; GWAS for quantitative traits; the use of GWAS data to estimate and partition genetic variation; principles and pitfalls of prediction analyses using genetic markers. For computer exercises, we will be using R, the Merlin suite of software, PLINK and GCTA. Background reading: Vinkhuyzen et al. 2013. Estimation and partition of heritability in human populations using whole-genome analysis methods. Annu Rev Genet. 23;47:75-95. doi: 10.1146/annurev-genet-111212-133258. Yang, J. et al. 2010. Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42: 565-569. Wray NR et al. 2013. Pitfalls of predicting complex traits from SNPs. Nature Review Genetics 14:507-15. doi: 10.1038/nrg3457.
Module 24: Pathway & Network Analysis for Omics Data
Instructor: Alison Motsinger-Rief and Ali Shojaie
Module description: Networks represent the interactions among components of biological systems. In the context of high dimensional omics data, relevant networks include gene regulatory networks, protein-protein interaction networks, and metabolic networks. These networks provide a window into biological systems as well as complex diseases, and can be used to understand how biological functions are implemented and how homeostasis is maintained. On the other hand, pathway-based analyses can be used to leverage biological knowledge available from literature, gene ontologies or previous experiments in order to identify the pathways associated with disease or an outcome of interest. In this module, various statistical learning methods for reconstruction and analysis of networks from omics data are discussed, as well as methods of pathway enrichment analysis. Particular attention will be paid to omics datasets with a large number of variables, e.g. genes, and a small number of samples, e.g. patients. The techniques discussed will be demonstrated in R. This course assumes a previous course in regression and familiarity with R or other command line programming languages. Users require a laptop and will use it in all sessions.
Module 25: Molecular Phylogenetics
Instructors: Mary Kuhner and Michael Miyamoto
Module description: Assumes the material in Modules 1 and 8. Overview of methods for analysis of interspecific DNA and protein sequence data. Coverage will include parsimony, maximum likelihood, distance-based, and Bayesian methods for phylogenetic estimation. Probabilistic models for sequence change will be emphasized. Related topics that will be presented are the comparative method, divergence time estimation, phylogenetic hypothesis testing, and detection of positive selection. Statistical methodology will be a focus and some related computational algorithms will be outlined. Brief introductions will be made to software packages such as PAUP*, Beast, PHYLIP, and MrBayes. Assumes the material in Module 1, Probability and Statistical Inference, and Module 9, Population Genetic Data Analysis. Background readings: Felsenstein, J. (2004) “Inferring Phylogenies.” Sinauer Associates, or Yang, Z. (2006) “Computational Molecular Evolution (Oxford Series in Ecology and Evolution).” Oxford University Press.