logo
welcome
MIT News

MIT News

Study: Transparency is often lacking in datasets used to train large language models

MIT News
Summary
Nutrition label

87% Informative

MIT researchers trace the data provenance of more than 1,800 text datasets.

They found that 70 percent of these datasets omitted some licensing information, while about 50 percent had errors.

The Data Provenance Explorer could help AI practitioners build more effective models by enabling them to select training datasets that fit their model’s intended purpose.

Researchers saw a spike in restrictions placed on datasets created in 2023 and 2024 .

This might be driven by concerns that their datasets could be used for unintended commercial purposes.

To help others obtain this information without the need for a manual audit, the researchers built the Data Provenance Explorer .

The tool allows users to download a data provenance card that provides a structured overview of dataset characteristics.

VR Score

93

Informative language

96

Neutral language

51

Article tone

informal

Language

English

Language complexity

70

Offensive language

not offensive

Hate speech

not hateful

Attention-grabbing headline

not detected

Known propaganda techniques

not detected

Time-value

long-living

Affiliate links

no affiliate links