Presentation: Sampling Designs for Resource Efficient Collection of Labeled Data from Electronic Medical Records
Candidate: Wei Ling Katherine Tan, Graduate Student, UW Biostatistics
Committee Members: Patrick Heagerty (Chair), Ruth Etzioni, Jennifer Nelson, Noah Simon, Robert Penfold (GSR)
Abstract: Electronic Medical Records (EMRs) are large databases which due to scale have facilitated biomedical research, for example modeling to determine diagnosis, to inform prognosis, and to define subgroups for randomized control trials. However, much of EMR data, for example free-text radiology reports, are unstructured and therefore not easily accessible for analysis. Label collection of unstructured data can increase value for clinical research. Yet, the process of label collection, called annotation or abstraction, is often expensive and time-consuming, and frequently becomes the bottleneck for supervised learning applications. In this talk, we discuss targeted designs for abstraction label collection, in the context of classification model development, recalibration, and revision. We first discuss a framework for label collection in the single outcome case, and demonstrate the validity and resource efficiency of such designs. Then, we discuss extensions to the multivariate binary outcome case motivated by radiology report processing. Finally, we discuss resource efficient designs to collect labels for model recalibration and revision, in the context of model external transfer to a new setting.