Speaker: Katherine Tan, Graduate Student, UW Biostatistics
Abstract: Statistical classification of medical outcomes using unstructured free-text data requires the collection of a training set, a sample for which actual outcome statuses (case: Y=1; control: Y=0) are hand-labeled by human medical experts. The process of label collection, also known as annotation, is often expensive and time-consuming, and frequently becomes the bottleneck in supervised prediction applications. In this talk, we present surrogate-guided sampling designs, a sampling framework to reduce the annotation burden when labeling training sets for rare outcomes. Our method is based on stratified sampling on variables we call enrichment surrogates. Enrichment surrogates are readily available data from Electronic Medical Record (EMR) databases, such as string matching of keywords or aggregates of International Classification of Disease (ICD) codes related to the target outcome. We demonstrate that training sets collected with surrogate-guided sampling designs can allow for valid model development and improved model prediction for the same cost of training sets created with simple random sampling. We discuss considerations for surrogate selection, and show an application using free-text radiology reports.