An Abstract Interpretation-Based Data Leakage Static Analysis

Filip Drobnjaković, Pavle Subotić, Caterina Urban

Abstract

Data leakage is a well-known problem in machine learning which occurs when the training and testing datasets are not independent. This phenomenon leads to overly optimistic accuracy estimates at training time, followed by a significant drop in performance when mod- els are deployed in the real world. This can be dangerous, notably when models are used for risk prediction in high-stakes applications. In this paper, we propose an abstract interpretation-based static analysis to prove the absence of data leakage. We implemented it in the NBLyzer framework and we demonstrate its performance and precision on 2111 Jupyter notebooks from the Kaggle competition platform.

Type

Technical Report

Publication

In arXiv/2211.16073

Date

2022

Links

PDF Project arXiv

Sedano