# Amit Moscovich: The perils of data preprocessing prior to cross-validation

**Time: **
Tue 2023-04-11 14.00 - 15.00

**Location: **
KTH, 3721 (Lindstedtsvägen 25)

**Participating: **
Amit Moscovich, Department of Statistics and OR, School of Mathematical Sciences, Tel Aviv Universit

### Abstract

Regression and classification methods are typically evaluated by cross-validation: repeatedly splitting the data into a training set and a validation set, learning a predictive model on the training set, then averaging its loss on the validation set. Under the i.i.d. assumption, this should give an unbiased and consistent estimator for the risk of the trained model. However, in practice, many data sets go through various stages of preprocessing, such as rescaling, dimensionality reduction, and outlier removal. Such "unsupervised'" preprocessing procedures, that do not involve the responses or class labels, are often considered harmless. However, they introduce a subtle leakage of information from the validation set to the trained model. This breaks the assumptions of cross-validation, potentially leading to biased risk estimates and sub-optimal model selection. In this talk, we make the case that this subtle error should receive more attention since it is prevalent in scientific research, potentially harmful, and typically easy to fix. We will explain where the bias is coming from, how to eliminate it, and walk through several toy examples, demonstrating an intricate dependency between the parameters of the problem and the resulting bias due to preprocessing.