To train more powerful large-scale language models, researchers use massive collections of datasets that mix diverse data from thousands of web sources.
However, combining and recombining these datasets into multiple collections often results in the loss or confusion of important information about the origins of those datasets and any restrictions on how they can be used.
This not only raises legal and ethical issues, but can also compromise the performance of the model. For example, if the data set is mislabeled, someone training a machine learning model for a particular task may unknowingly use data that was not designed for that task.
Moreover, data from unknown sources may contain bias, which may lead to unfair predictions when the model is deployed.
To improve data transparency, a multidisciplinary team of researchers from MIT and elsewhere began a systematic audit of more than 1,800 text data sets from popular hosting sites. They found that more than 70 percent of these data sets were missing some licensing information, and about 50 percent contained information that was in error.
Based on these insights, they developed a user-friendly tool called Data Provenance Explorer, which automatically generates an easy-to-read summary of a dataset’s creator, source, license, and permitted uses.
“These types of tools can help regulators and practitioners make informed decisions about AI deployment and promote responsible development of AI,” says Alex “Sandy” Pentland, an MIT professor, leader of the Human Dynamics Group at the MIT Media Lab, and co-author of a new open-access paper on the project.
Data Provenance Explorer can help AI practitioners build more effective models by allowing them to select training data sets that are appropriate for the model’s intended purpose. In the long run, this can improve the accuracy of AI models in real-world situations, such as assessing loan applications or responding to customer inquiries.
“One of the best ways to understand the capabilities and limitations of an AI model is to understand what data it was trained on. When there is misattribution and confusion about data provenance, serious transparency issues arise,” says Robert Mahari, a graduate student in MIT’s Human Dynamics Group, a JD candidate at Harvard Law School, and co-lead author of the paper.
Mahari and Pentland worked on the paper with Media Lab graduate student and co-author Shayne Longpre, Sara Hooker, who leads the AI lab Cohere, and other researchers from MIT, the University of California, Irvine, the University of Lille in France, the University of Colorado, Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift. The research was published today. Nature Machine Intelligence.
Focus on fine-tuning
Researchers often use a technique called fine-tuning to improve the performance of large-scale language models that are deployed on a specific task, such as question answering. In fine-tuning, they carefully build a curated data set designed to improve the model’s performance on this one task.
The MIT researchers often focused on these fine-tuned datasets, which are developed by researchers, academic institutions, or companies and then licensed for specific uses.
When crowdsourcing platforms aggregate these datasets into large collections that practitioners can use for fine-tuning, some of the original licensing information often remains.
“These licenses must be significant and enforceable,” Mahari said.
For example, if a dataset has incorrect or missing licensing terms, someone may spend a lot of time and money developing a model, and some of the training data may contain personal information, requiring the model to be deleted later.
Longpre added that “people can ultimately end up training models without understanding the capabilities, concerns or risks of the models that come from the data.”
To begin this study, the researchers formally defined data provenance as a combination of the sourcing, creation, and licensing heritage of a dataset and its characteristics. From there, they developed a structured audit procedure to track the data provenance of a collection of over 1,800 text datasets from popular online repositories.
After discovering that over 70% of these datasets contained “unspecified” licenses that omitted a lot of information, researchers backtracked to fill in the gaps. This effort reduced the number of datasets with “unspecified” licenses to about 30%.
Their research also showed that the correct license is often more restrictive than the license assigned by the repository.
They also found that almost all dataset authors were concentrated in the Global North, which could limit the model’s ability to be trained for deployment in other regions. For example, a Turkish dataset created primarily by people in the U.S. and China might not include culturally important aspects, Mahari explains.
“We’re almost fooling ourselves into thinking that the data set is more diverse than it really is,” he said.
Interestingly, researchers saw a sharp increase in restrictions on datasets created in 2023 and 2024, which may be due to concerns in the academic community that datasets could be used for unintended commercial purposes.
Easy to use tools
To help others obtain this information without manual auditing, researchers built the Data Provenance Explorer. In addition to sorting and filtering datasets by specific criteria, the tool also allows users to download data source cards that provide a concise and structured overview of the dataset’s characteristics.
“We hope this is not only a step toward understanding the landscape, but also helps people make more informed choices about what data to train on going forward,” says Mahari.
In the future, the researchers plan to expand their analysis to investigate the data sources of multimodal data, including video and audio. They also want to study how the terms of service of websites used as data sources are reflected in the dataset.
As they expand the scope of their research, they are reaching out to regulators to discuss the unique copyright implications of their findings and data fine-tuning.
“When people create these datasets and make them public, there needs to be data provenance and transparency from the beginning, so that others can more easily access these insights,” Longpre says.
“Many proposed policy interventions assume that data can be properly assigned and identified as having licenses associated with it, but this work first shows that this is not the case, and then significantly improves the available provenance information,” says Stella Biderman, executive director of EleutherAI, who was not involved in the work. “Section 3 also includes a relevant legal discussion, which will be invaluable to machine learning practitioners outside of companies large enough to have dedicated legal teams. Many people who want to build AI systems for the public good are currently quietly struggling to figure out how to handle data licensing, because the internet was not designed in a way that makes it easy to discover data provenance.”