Module 1: Probability and Statistical Inference
Instructors: Jim Hughes and David Yanez
This module covers the laws of probability and the binomial, multinomial, and normal distributions. It covers descriptive statistics and methods of inference, including maximum likelihood, confidence intervals and simple Bayes methods. Classical hypothesis testing topics, including type I and II errors, two-sample tests, chi-square tests and contingency table analysis, and exact and permutation tests. Resampling methods, such as the bootstrap and jackknife, are covered as well. This module serves as a foundation for almost all of the later modules. Co-listed with the Summer Institute in Statistics and Modeling in Infectious Diseases.
Module 2: Molecular Genetics and Genomics
Instructors: Josh Akey and Greg Gibson
The molecular genetics and genomics module covers the theory and practice of modern genetics. It is designed to provide biologists with the foundations upon which statistical genetics is built, and/or an introduction to the concepts of classical and contemporary genetics for statisticians and informaticians. We start with the key concepts of quantitative and Mendelian genetics and then illustrate how these have been reconciled with molecular biology. Two half days are then spent on the basics of genome-wide association mapping as well as exome and whole genome sequencing; and on gene expression profiling and integrative genomics leading to systems biology. On the final afternoon the instructors provide their perspectives on the future of personalized medicine, and on evolutionary genetics.
Module 3: Introduction to R
Instructors: Ken Rice and Tim Thornton
This module introduces the R statistical environment, assuming no prior knowledge. It provides a foundation for the use of R for computation in later modules. In addition to discussing basic data management tasks in R, such as reading in data and producing summaries through R scripts, we will also introduce R’s graphics functions, its powerful package system, and simple methods of looping. Examples and exercises will use data drawn from biological and medical applications, including infectious diseases and genetics. Hands-on use of R is a major component of this module; users require a laptop and will use it in all sessions. Co-listed with the Summer Institute in Statistics and Modeling in Infectious Diseases.
Module 4: Bayesian Statistics for Genetics
Instructors: Peter Hoff and Jon Wakefield
The use of Bayesian methods in genetics has a long history. In this introductory module we will begin by discussing introductory probability. We will then describe Bayesian approaches to binomial proportions, multinomial proportions, two-sample comparisons (binomial, Poisson, normal), the linear model, and Monte Carlo methods of summarization. Advanced topics will be touched on, including hierarchical models, generalized linear models, and missing data. Illustrative applications will include: Hardy-Weinberg testing and estimation, detection of allele-specific expression, QTL mapping, testing in genome-wide association studies, mixture models, multiple testing in high throughput genomics. Background Reading: P.D. Hoff (2009). A First Course in Bayesian Statistical Methods. Springer-Verlag.
Module 5: Regression and Analysis of Variance
Instructors: Rebecca Hubbard and Lurdes Inoue
This module is designed as a foundation for the quantitative genetics and QTL modules as well as for the association mapping modules. It assumes the material in Module 1 and it will cover the basic commands in R. It covers linear regression and analysis of variance. This module includes both lectures and interactive data analysis using R. Specific topics discussed are: simple linear regression; multiple linear regression; residual analysis; transformations; one-way ANOVA; two-way ANOVA; analysis of covariance; multiple comparisons.
Module 6: Population Genetic Data Analysis
Instructors: Jérôme Goudet and Bruce Weir
This module serves as a foundation for many of the later modules. Estimates and sample variances of allele frequencies, Hardy-Weinberg and linkage disequilibrium, characterization of population structure with F-statistics. Relationship estimation. Statistical genetic aspects of forensic science and association mapping. Concepts illustrated with R exercises. Background reading: Holsinger, K. and Weir, B.S. 2009. Genetics in geographically structured populations: defining, estimating, and interpreting Fst . Nature Reviews Genetics 10:639–650. Weir, B.S. and Laurie, C.C. 2011. Statistical genetics in the genome era. Genetics Research 92:461–470.
Module 7: Quantitative Genetics
Instructors: Bill Muir and Bruce Walsh
Quantitative Genetics is the analysis of complex characters where both genetic and environment factors contribute to trait variation. Since this includes most traits of interest, such as disease susceptibility, crop yield, and all microarray data, a working knowledge of quantitative genetics is critical in diverse fields from plant and animal breeding, human genetics, genomics, to ecology and evolutionary biology. The course will cover the basics of quantitative genetics including: Fishers variance decomposition, covariance between relatives, heritability, inbreeding and crossbreeding, and response to selection. Also an introduction to advanced topics such as: Mixed Models, BLUP, QTL mapping; correlated characters; and the multivariate response to selection. Background reading: Lynch, M. and Walsh, B. 1998. Genetics and analysis of quantitative traits. Sinauer Associates. Assumes the material in Modules 1, 5 and 6. Provides a foundation for modules 11, 12 and 17.
Module 8: Population Genetics and Association Mapping
Instructors: Katie Kerr and Tim Thornton
Topics covered include: basic probability and Mendelian genetics; Hardy-Weinberg equilibrium; inbreeding coefficients; population structure; recombination and genetic linkage; linkage disequilibrium; measures of relatedness; haplotype frequency estimation with unphased genotypes, genetic association testing; association testing in the presence of population structure and/or relatedness. Many concepts are illustrated with public domain software such as R and HAPLOVIEW. Background reading: Weir, B.S. (1996). “Genetic Data Analysis II.” Sinauer Associates; Thornton and McPeek (2010) “ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure.” American Journal of Human Genetics 86:172–184. Assumes the material in Module 1 and it serves as the foundation for many later modules. There is an overlap between Modules 6 and 8.
Module 9: Gene Expression Profiling
Instructors: Greg Gibson and Michael Inouye
The gene expression module will cover the theory and application of transcriptomics, including both microarray and RNA-Seq methodologies. The focus of the module is on the statistical basis of hypothesis testing, covering the central role of normalization strategies with the opportunity for students to work examples using the SNM module in R, as well as the fundamentals of ANOVA and the False Discovery Rate procedure. In addition, we will discuss options for downstream processing by clustering and module detection, finishing with expression QTL analysis and integrative genomics to infer pathway structure.
Module 10: MCMC for Genetics
Instructors: Eric Anderson and Matthew Stephens
This module examines the use of Bayesian Statistics and Markov chain Monte Carlo methods in modern analyses of genetic data. It assumes a solid foundation in basic statistics and the concept of likelihood as well as some population genetics. A basic familiarity with the R statistical package, or other computing language, will be helpful. The first day includes an introduction to Bayesian statistics, Monte Carlo, and MCMC. Mathematical concepts covered include expectation, laws of large numbers, and ergodic and time-reversible Markov chains. Algorithms include the Metropolis-Hastings algorithm and Gibbs sampling. Some mathematical detail is given; however, there is considerable emphasis on concepts and practical issues arising in applications. Mathematical ideas are illustrated with simple examples and reinforced with a computer practical using the R statistical language. With that background, two applications of MCMC are investigated in detail: inference of population structure (using the program STRUCTURE) and haplotype inference (using the program PHASE). Computer practicals using both programs are included. Further topics include the use of MCMC in model evaluation and model checking, strategies for assessing MCMC convergence and diagnosing MCMC mixing problems, importance sampling, and Metropolis-coupled MCMC. Software used: R, STRUCTURE, PHASE. Background reading: Shoemaker, J.S., Painter, I.S. and Weir, B.S. (1999). Bayesian statistics in genetics. Trends in Genetics 15:354–358. Beaumont, M.A. and Rannala, B. (2004). The Bayesian revolution in genetics. Nature Reviews Genetics 5:251–261. Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (1996). “Markov Chain Monte Carlo in Practice.” Chapman and Hall.
Module 11: Introduction to QTL Mapping
Instructors: Rebecca Doerge and Zhao-Bang Zeng
This module will systematically introduce statistical methods for mapping quantitative trait loci (QTL) in experimental cross populations. Topics include experimental designs, linkage map construction, single-marker analyses, interval mapping, composite interval mapping and multiple interval mapping. Significance thresholds for genome scan and model selection will also be discussed. Uses public domain software Windows QTL-Cartographer for computer lab exercises. Emphasis is on procedures for QTL mapping data analysis and appropriate interpretation of mapping results rather than on formulas .Assumes the material in Modules 1, 5 and 7. Material in Modules 6 or 8 would be helpful.
Module 12: Mixed Models in Quantitative Genetics
Instructors: Bill Muir and Bruce Walsh
The analysis of linear models containing both fixed and random effects. Topics to be discussed include a basic matrix algebra review, the general linear model, derivation of the mixed model, BLUP and REML estimation, estimation and design issues, Bayesian formulations. Applications to be discussed include estimation of breeding values and genetic variances in general pedigrees, association mapping, genomic selection, direct and associative effects models of general group and kin selection, genotype by environment interaction models. Background reading: Lynch, M. and B. Walsh. 1998. Genetics and analysis of quantitative traits. Sinauer Associates.
SISMID Module 12: Evolutionary Dynamics and Molecular Epidemiology of Viruses
Instructors: Philippe Lemey and Marc A. Suchard
This module covers the use of phylogenetic and bioinformatic tools to analyze pathogen genetic variation and to gain insight in the processes that shape their diversity. The module focuses on phylogenies and how these relate to population genetic processes in infectious diseases. In particular, the module will cover Bayesian Evolutionary Analysis by Sampling Trees (BEAST). This software will be used in class exercises that are mainly focused on estimating epidemic time scales, reconstruction changes in viral population sizes through time and inference of spatial diusion of viruses. Evolutionary processes including recombination and selection will also be considered. Assumes the material in Module 1. Co-listed with the Summer Institute in Statistics and Modeling in Infectious Diseases.
Module 13: Molecular Phylogenetics
Instructors: Joe Felsenstein, Mark Holder and Jeff Thorne
Overview of methods for analysis of interspecific DNA and protein sequence data. Coverage will include parsimony, maximum likelihood, distance-based, and Bayesian methods for phylogenetic estimation. Probabilistic models for sequence change will be emphasized. Related topics that will be presented are the comparative method, divergence time estimation, phylogenetic hypothesis testing, and detection of positive selection. Statistical methodology will be a focus and some related computational algorithms will be outlined. Brief introductions will be made to software packages such as PAUP*, Beast, PHYLIP, and MrBayes. Background readings: Felsenstein, J. (2004) “Inferring Phylogenies.” Sinauer Associates, or Yang, Z. (2006) “Computational Molecular Evolution (Oxford Series in Ecology and Evolution).” Oxford University Press. Assumes the material in Modules 1 and 2. Material in Modules 6 or 8 would be helpful.
Module 14: Elements of R for Genetics and Bioinformatics
Instructors: Thomas Lumley and Ken Rice
This module introduces programming skills required for analysis of genetic data, in the R statistical environment. The module assumes prior knowledge of R at the level described in Module 3. We will briefly review how R scripts are built up from interactive commands, and then discuss how to turn these into R programs based on user-defined functions. We will cover R’s debugging system, its tools for enhancing efficiency of code, and its methods for handling errors and warnings. In many genomic applications, R packages already exist to perform specialized statistical and bioinformatic analyses, and we will introduce packages for handling large datasets, and the Bioconductor repository of genomic packages. Users require a laptop and will use it in all sessions.
SISMID Module 14: Introduction to Metagenomic Data Analysis
Instructors: Alexander Alekseyenko and Susan Holmes
This course is concerned with analysis of microbial community data generated by next-generation sequencing technologies. These high-throughput methods allow for deep surveying of microorganisms inhabiting their biological hosts. We will cover the steps for preprocessing Roche 454 sequencing data, necessary to produce abundance tables. We will then examine methodology for associating microbial abundance data with experimental factors and outcomes. Programming will be done in R, and overview of available programs for preprocessing the data will be provided. Pre-requisites: Module 1: Probability and Statistical Inference. Co-listed with the Summer Institute in Statistics and Modeling in Infectious Diseases.
Module 15: Coalescent Theory
Instructors: Philip Awadalla and Mary Kuhner
This module is an introduction to the coalescent and its applications to modern population genetics and genomics. Assumes material in Modules 1 and 5. Material in Module 6 or 8 and 13 would be helpful. Derivation and properties of basic coalescent model and extension to include factors such as recombination, geographic structure and natural selection. Use of the coalescent in analyzing data for disease gene mapping, recombination rate estimation, and detection of recent adaptive evolution. Use of coalescent methodologies in large-scale surveys of genetic variation. Applications to standard or next-generation sequencing data for inferences from natural populations and disease cohorts. Use of public domain software.
Module 16: High-Dimensional Omics Data
Instructors: Ali Shojaie and Daniela Witten
In this course, we will present a number of statistical machine learning methods for the analysis of high-dimensional biological data, often referred to as ‘omics.’ Examples include genomic, transcriptomic, metabolomic, proteomic, and other large-scale data sets, typically characterized by a huge number of molecular measurements (such as genes) and a relatively small number of samples (such as patients). In the first part of the course, we will cover supervised learning methods that are useful in the analysis of omics data. These include penalized approaches for performing regression, classification, and survival analysis in the high-dimensional setting. In the second part of the course, we will discuss unsupervised approaches for the analysis of omics data, such as clustering and principal components analysis. Throughout the course, we will highlight the effects of high dimensionality and focus on common pitfalls in the analysis of omics data, and how to avoid them. The techniques discussed will be demonstrated in R. This course assumes a previous course in regression and statistical hypothesis testing, and some familiarity with R or other command line programming languages.
Module 17: Human Complex Traits
Instructors: Mike Goddard and Peter Visscher
This module focuses on the genetics and analysis of quantitative traits in human populations, with emphasis on estimation and prediction analysis using genetic markers. Topics include: the resemblance between relatives for quantitative traits; principles of linkage analysis; estimation of genetic variance associated with genome-wide identity by descent; GWAS for quantitative traits; the use of GWAS data to estimate and partition genetic variation; principles and pitfalls of prediction analyses for quantitative traits using genetic markers. Background reading: Visscher, P.M. et al. (2007) Genome partitioning of genetic variation for height from 11,214 sibling pairs. American Journal of Human Genetics 81:11041110 (2007); Yang, J. et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nature Genetics 42: 565-569. This module assumes the material in Modules 1 and 6 or 8. Material in Module 7 would be helpful.
Module 18: Advanced R Programming for Bioinformatics
Instructors: Thomas Lumley and Ken Rice
This module covers techniques needed for software development in R. The module includes advanced graphics, object-oriented programming, SQL database use, some of the Bioconductor infrastructure, writing packages, and calling C code from R. The module is aimed at people who have either substantial R experience or programming experience in other languages; Module 14 alone would not be sufficient preparation. Background reading: Gentleman. R. (2008) R Programming in Bioinformatics. Taylor & Francis. Users require a laptop and will use it in all sessions.
Module 19: Network and Pathway Analyses of Omics Data
Instructors: Alison Motsinger-Reif and Ali Shojaie
Networks represent the interactions among components of biological systems. In the context of high dimensional omics data, relevant networks include gene regulatory networks, protein-protein interaction networks, and metabolic networks. These networks provide a window into biological systems as well as complex diseases, and can be used to understand how biological functions are implemented and how homeostasis is maintained. On the other hand, pathway-based analyses can be used to leverage biological knowledge available from literature, gene ontologies or previous experiments in order to identify the pathways associated with disease or an outcome of interest. In this module, various statistical learning methods for reconstruction and analysis of networks from omics data are discussed, as well as methods of pathway enrichment analysis. Particular attention will be paid to omics datasets with a large number of variables, e.g. genes, and a small number of samples, e.g. patients. The techniques discussed will be demonstrated in R. This course assumes a previous course in regression, previous exposure to the material covered in Module 16, and familiarity with R or other command line programming languages.
Module 20: Animal Genetic Data Analysis
Instructors: Mike Goddard and Michel Georges
Topics include theory of linkage disequilibrium and mapping, population and family-based association techniques for discrete and continuous traits, methods for detecting and accounting for population structure, multiple testing issues, and genotyping strategies. Genomic Prediction.
Module 21: Statistical Genetics for Forensic Science
Instructor: Bruce Weir
Although the use of genetic profiles is now routine, there are still issues for which care must be taken when attaching statistics to matching profiles. These include the effects of relatives and population structure, the use of lineage markers and the interpretation of mixtures. An introductory account will be given of the use of likelihood ratios, familial searching, the “birthday problem” and Y-STR profiles. A discussion of the use of peak heights and allelic dropout will be given.
Module 22: Beginning Scripting for Biologists
Instructor: Dahlia Nielsen, Christopher Smith, Stephanie Gogarten
This course is designed for people with little or no previous programming experience. Topics include introduction to the unix (linux) command line environment, basic programming logic, and the essentials of the python scripting language. Examples of tasks that are covered include reading data from an input source, performing queries and mathematical manipulations on these data, and writing results to output files. The course is taught using a number of hands-on exercises. Users require a laptop and will use it in all sessions.