Speaker: Brian Williamson, Graduate Student, UW Biostatistics
Abstract: Assessing the relative contribution of subsets of features towards predicting the response is often of interest in predictive modeling applications. Often, simple population models are used because the associated variable importance measure is easy to interpret; however, estimates may be misleading if the model used is overly simplistic. In an effort to improve prediction performance, complex prediction algorithms are often used instead; however, in these cases variable importance is often defined as a function of the algorithm rather than a summary of the population, rendering importance less interpretable and more difficult to compare across algorithms. Thus, there is a natural tension between prediction performance and interpretability when defining variable importance. To resolve this issue, it is useful to distinguish between the population mechanism that gave rise to the data and the algorithm that makes predictions based on the data. In this dissertation, we study variable importance measures that may be used with any prediction technique, and their interpretation is agnostic to the technique used. Specifically, we define variable importance as a contrast between the predictiveness of the best possible prediction function based on all available features versus all features but those under consideration. We discuss general conditions under which a simple estimator of this importance is nonparametric efficient and allows the construction of valid confidence intervals, even when machine learning techniques are used to construct the estimator. We also propose a valid strategy for hypothesis testing. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection.