New tools make it easier for database users to perform complex statistical analysis on tabular data without needing to know what’s actually happening with the data on the surface.
GenSQL, a generative AI system for databases, can help users make predictions, detect anomalies, infer missing values, correct errors, and generate synthetic data with just a few keystrokes.
For example, if this system were used to analyze medical data from a patient who has always had high blood pressure, it might capture blood pressure measurements that are low for that particular patient but otherwise within the normal range.
GenSQL automatically integrates tabular data sets with generative probabilistic AI models to account for uncertainty and adjust decisions based on new data.
Additionally, GenSQL can be used to generate and analyze synthetic data that mimics real data in a database. This can be especially useful in situations where sensitive data, such as patient health records, cannot be shared or where real data is scarce.
This new tool was built on SQL, a programming language for creating and manipulating databases that was introduced in the late 1970s and is used by millions of developers worldwide.
“Historically, SQL taught the business world what computers could do. They didn’t have to write custom programs; they could just ask questions of the database in a high-level language. We think that as we transition from simply querying data to asking questions of models and data, we’re going to need a similar language that teaches people a consistent set of questions they can ask computers that have probabilistic models of the data,” says Vikash Mansinghka, lead author of the paper introducing GenSQL and a principal research scientist and leader of the Probabilistic Computing Project in MIT’s Department of Brain and Cognitive Sciences.
When researchers compared GenSQL to popular AI-based approaches for data analysis, they found that GenSQL was not only faster, but also produced more accurate results. Importantly, the probabilistic models used by GenSQL are explainable, meaning they can be read and edited by users.
“If you look at your data and try to find meaningful patterns using just a few simple statistical rules, you can miss important interactions. Your model needs to capture correlations and dependencies of variables, which can be very complex. With GenSQL, we want to enable many users to query the data and models without needing to know all the details,” added lead author Mathieu Huet, a research scientist in the Department of Brain and Cognitive Sciences and a member of the Probabilistic Computing project.
The paper was co-authored by MIT graduate students Matin Ghavami and Alexander Lew, research scientist Cameron Freer, Ulrich Schaechtel and Zane Shelby of Digital Garage, Martin Rinard, a professor of electrical engineering and computer science and a member of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and Feras Saad, an assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.
Combining models and databases
SQL stands for Structured Query Language, a programming language for storing and manipulating information in databases. In SQL, people can ask questions about data using keywords, such as summing, filtering, or grouping database records.
However, querying the model can provide deeper insights, because the model can figure out what the data means to the individual. For example, a female developer wondering if she is underpaid is likely to be more interested in what salary data means to her personally than in trends in database records.
Researchers found that SQL does not provide an effective way to integrate probabilistic AI models, while approaches that use probabilistic models to make inferences do not support complex database queries.
They built GenSQL to fill this gap, allowing you to query both datasets and probabilistic models using a simple, yet powerful, formal programming language.
GenSQL users upload their data and probabilistic models, and the system automatically integrates them. She can then run queries on the data that feeds the probabilistic models that run in the background. This allows for more complex queries and more accurate answers.
For example, a query in GenSQL might be something like, “How likely is it that a developer in Seattle knows the programming language Rust?” If you only look at correlations between columns in your database, you might miss subtle dependencies. Incorporating probabilistic models can capture more complex interactions.
Moreover, the probabilistic models that GenSQL utilizes are auditable, so people can see what data the models use to make decisions. Furthermore, these models provide a measure of calibrated uncertainty along with each answer.
For example, if you apply this adjusted uncertainty to query the model for prediction outcomes for various cancer treatments for patients with underrepresented ethnicities in your dataset, GenSQL will tell you that the results are uncertain and how uncertain they are, rather than overly confidently advocating for the wrong treatment.
Faster and more accurate results
To evaluate GenSQL, the researchers compared their system to popular baseline methods that use neural networks. GenSQL was 1.7 to 6.8 times faster than these approaches, executing most queries in milliseconds while providing more accurate results.
They also applied GenSQL to two case studies: one where the system identified mislabeled clinical trial data, and another where it generated accurate synthetic data that captured the complex relationships in genomics.
Next, researchers want to apply GenSQL more broadly to perform large-scale population modeling. GenSQL can be used to create synthetic data that can be used to draw inferences about things like health and income, while still controlling the information used in the analysis.
They also want to make GenSQL easier to use and more powerful by adding new optimizations and automation to the system. In the long term, the researchers want to enable users to make natural language queries in GenSQL. Their goal is to eventually develop AI experts like ChatGPT that use GenSQL queries to base their answers.
This research was supported in part by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.