Speaker: Katherine Tan, Graduate Student, UW Biostatistics
Abstract: In leveraging data derived from large-scale Electronic Medical Record (EMR) systems for research, an important first step is the accurate identification of key clinical outcomes. Some outcomes are recorded in structured data, while other outcomes must be derived or predicted from both structured and unstructured data. Statistical classification of clinical outcomes derived from unstructured data requires the collection of a training set, which is a sample where actual binary outcomes are abstracted and labeled by human medical experts. When the outcome is rare, simple random sampling (SRS) for abstraction results in very few cases for classifier development, yet additional abstraction is often expensive and time-consuming. In this talk, I discuss sampling designs for outcome label collection and subsequent machine-learning targeting the rare outcome scenario. The proposed designs results in samples that are amenable for valid analysis, and are more resource efficient, requiring a smaller sample size for modeling goals compared to conventional SRS. I first introduce surrogate-guided sampling (SGS) designs, a stratified sampling procedure based on values of enrichment surrogates, which are summaries of keywords or International Classification of Disease (ICD) codes related to the clinical outcome of interest. Next, motivated by radiology reports with multiple co-occurring findings, I discuss extensions to the multi-label setting. Finally, for the scenario where a previously developed “source” model is to be validated and modified to a new setting, I discuss sampling designs for such “new” outcome label collection.