This Science News Wire page contains a press release issued by an organization and is provided to you "as is" with little or no review from Science X staff.

Theoretical Framework for LLM Data Markets Addresses Current Ethical, Societal Challenges

February 23rd, 2026

Training data is the backbone of large language models (LLMs), yet today's data markets often operate under exploitative pricing, sourcing data from marginalized groups with little pay or recognition. In a new study, researchers introduce a theoretical framework for LLM data markets with a pricing mechanism that quantifies the contribution of each set of data.

The study, by researchers at Carnegie Mellon University, is published in NeurIPS 2025, the proceedings of the 39th Conference on Neural Information Processing Systems.

"By addressing the ethical and societal challenges of the current data market, our framework offers a concrete path toward fair, transparent, and economically sustainable data markets for LLMs," says Beibei Li, professor of technology and management at Carnegie Mellon's Heinz College, who coauthored the study. "Using data valuation, economics of market, and game theory, our framework serves as a guide for designing sustainable data procurement practices in public and private sectors."

High-quality training data is foundational to building effective and reliable LLMs. As LLMs take on increasingly complex tasks, including coding, reasoning, and AI4Science, they rely heavily on carefully curated data that is annotated by humans. As a result of this growing demand, major tech companies are racing to acquire training data, fueling the rise of a nascent AI data market. In this market, AI firms create networks of short-term contract workers to generate data labels, resembling an Uber-like gig economy for data.

However, the current AI data market operates with limited oversight and has been widely criticized for a lack of transparency and fairness in pricing, and for undervaluing the labor of human annotators and content creators. These harms are concentrated in low-wage labor markets, where annotators often experience overwork, underpayment, and exclusion from decision making. This reflects a broader ethical concern known as AI parachuting, where developers extract data from marginalized communities.

Motivated by these issues, in this study, researchers developed a fair pricing framework for the LLM training data market to promote equitable and sustainable generative AI ecosystems. Guided by economic theory, which suggests that prices should reflect the value delivered to the buyer, they developed fairshare pricing based on established data-valuation techniques for LLMs, which quantify each dataset's contribution to model performance. Fairshare pricing offers clear advantages over existing methods, the study found:

  • Existing exploitative pricing leads to a lose-lose outcome for the data market: For data buyers, underpaying data sellers cuts costs in the short term but drives sellers away, shrinking the supply of high-quality training data, weakening the data pipeline, and limiting model improvement, even as investments grow. In contrast, fairshare pricing leads to a win-win outcome: Sellers maximize profit while remaining engaged and buyers secure long-term utility by maintaining access to high-quality data.
  • In simulations of buyer-seller interactions in data markets, under fairshare pricing, buyers achieved higher levels of model performance per dollar spent, benefitting those with limited budgets. In simulations of long-term market dynamics, fairshare pricing encouraged sustained seller participation, resulting in a stable and sufficient supply of training data over time compared to exploitative pricing.
  • In an assessment using a diverse set of data-valuation methods, fairshare pricing consistently delivered beneficial results for both buyers and sellers in the LLM data market, confirming that its performance was not tied to any specific data-valuation technique.

The proposed framework offers actionable insights for policymakers and regulators aiming to ensure fairness and transparency in LLM training data markets, say the authors. By fostering fair market access, it also empowers small businesses and startups, which can lead to more equitable technological advancements.

"A key strength of our approach is that buyers and sellers can flexibly choose valuation methods tailored to their downstream needs without compromising the incentive-aligned structure of the market, which demonstrates the broad applicability of our solution," explains Luyang Zhang, a Ph.D. student in statistics at Carnegie Mellon's Heinz College, who coauthored the study.

In addition to Li and Zhang, the study's other co-authors are also scholars at Carnegie Mellon: Cathy Jiao, a Ph.D. student, and Chenyan Xiong, associate professor, both at the Language Technologies Institute in Carnegie Mellon's School of Computer Science.

More information:
Summarized from an article in NeurIPS2025, "Fairshare Data Pricing via Data Valuation for Large Language Models," by Zhang, L (Carnegie Mellon University), Jiao, C (Carnegie Mellon University), Li, B (Carnegie Mellon University), and Xiong, C (Carnegie Mellon University). Copyright 2025. All rights reserved.

Provided by Carnegie Mellon University's Heinz College

Citation: Theoretical Framework for LLM Data Markets Addresses Current Ethical, Societal Challenges (2026, February 23) retrieved 23 February 2026 from https://sciencex.com/wire-news/533306007/theoretical-framework-for-llm-data-markets-addresses-current-eth.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.