What Programs Want: Automatic Inference of Input Data Specifications


Nowadays, thanks to advances in machine learning and the availability of massive amounts of data, computer software plays an increasingly important role in assisting or even autonomously making decisions with far-reaching societal impact. Such data-driven decision-making software is generally developed with certain assumptions regarding its input data (e.g., format, range of values, etc.). This impacts software quality and limits its reuse: with unclear or lacking documentation, software can be used incorrectly and crash after many hours of computation or, worse, produce a plausible but incorrect result. To address this issue we propose a new static program analysis that automatically infers input data specifications. These specifications represent necessary program preconditions: input data that does not comply with the inferred specification will definitely cause the program to behave incorrectly. We present the challenges we faced in designing and developing the analysis and we show encouraging preliminary results. We also discuss further applications such as the automatic synthesis of data cleaning code, or the automatic checking and repairing of the input data.

🇮🇹 Gran Sasso Science Institute (GSSI), Italy