Static Analysis for Data Scientists

Abstract

Big data analytics has revolutionized the world of software development in the past decade. Every day, data scientists write computer programs to clean, manipulate, and visualize data, in order to help us make data-driven decisions. As we rely more and more on data analytics software, we become increasingly vulnerable to programming or technical mistakes. Mistakes that do not cause software failures can have serious consequences, since they give no indication that something went wrong. A simple technical mistake made during data processing caused nearly 16,000 cases of Covid-19 between September 25th and October 2nd, 2020 to go unreported from official figures in the UK. As a consequence, Public Health England was unable to send out the relevant contact-tracing alerts. Mistakes in safety-critical applications can be deadly. In this talk, I will present ongoing work to develop an abstract interpretation-based static analysis framework for data scientists. In particular, I will focus on an analysis that infers necessary conditions on the structure and values of the data read by a data analytics program. The analysis builds on a family of underlying abstract domains, extended to indirectly reason about the input data rather than simply reasoning about the program variables. The choice of these abstract domains is a parameter of the analysis. We describe various instances built from existing abstract domains. We then demonstrate the potential of the approach on a number of representative examples and discuss ongoing efforts to target data analytics using Jupyter notebooks.

Date
Location
🇬🇧 Cambridge, UK (remote)

Lyra