The value of your data is not token-worth

Scroll for more

The fallacy of of token-based data valuation in AI

In the context of generative AI, data contribution is reduced to a simplistic, mechanistic calculation based on the number of ‘tokens’ ingested by a model. The valuation is brutally simple: How many words? How many tokens? Nothing more.

Under this token-counting regime, all inputs are treated as functionally equivalent. The AI model doesn’t distinguish between scholarly depth and casual conversation, verified scientific research and unverified personal opinions, peer-reviewed content and crowd-sourced information, original intellectual work and derivative commentary. This means a meticulously researched PhD dissertation — representing years of rigorous academic work, original research, and deep intellectual contribution — is valued identically to a random collection of social media posts, internet comments, or viral tweets.

This approach not only devalues intellectual labor but actively disincentivizes the creation or contribution of high-quality, nuanced content. Why invest years in rigorous research when a few thousand viral tweets can contribute the same “value” to an AI model?

In order to create more advanced and sophisticated LLMs, we must move away from consuming massive quantities of undifferentiated text — and instead start focusing on meaningful, high-quality contributions that genuinely advance knowledge and understanding. This requires AI model trainers to recognize that a single, ingeniously crafted dataset can be exponentially more valuable than massive volumes of generic information and that unique perspectives, creative insights, and specialized knowledge deserve proportional recognition.

Beyond tokenization: The true value of data

The fundamental premise is straightforward yet profound: the value of your data should not be token-based. It’s not about how many units of information you provide, but the depth, originality, and transformative potential of that data.

While the law is lagging behind, technological solutions can and should create transparent mechanisms for attributing and rewarding these meaningful contributions. In particular, with AI-RM (AI Rights Management), the old paradigm of treating data as a quantitative commodity — simply counting tokens or data points — can give way to a more nuanced, qualitative approach, where data is evaluated through a multidimensional lens, based on a multiplicity of qualitative factors — e.g. the originality of insight, the depth of analysis, the verifiability of information, the expertise of the source, or the unique perspective contributed.

AI-RM represents a holistic approach to managing rights in the AI ecosystem. In addition to providing legal and technological tools to protect intellectual property, AI-RM also provides automated attribution systems, with transparent tracking of data provenance, as well as dynamic reward mechanisms that go beyond simple volume-based compensation.

Microsoft’s attempts at data attribution

Microsoft is pioneering a groundbreaking research project based on the concept of “training-time provenance” which could constitute an important breakthrough in the way we acknowledge human creativity behind artificial intelligence.

At the heart of Microsoft’s initiative is a revolutionary concept: tracking how specific pieces of data — whether photographs, written works, or artistic creations — directly influence the final output of an AI system. This goes far beyond simple token counting. As detailed in a LinkedIn job listing for a research intern, the goal is to develop a sophisticated method of estimating the precise contributions of various data sources.

Key insights from the research project include:

- Transparent Data Attribution: The research aims to shed light on the typically opaque process of how neural networks amalgamate diverse inputs. This means being able to trace exactly how a particular artist’s style, a specific writer’s narrative technique, or a unique photographic composition might influence an AI-generated output.
- Inspiration from Data Dignity: The project draws heavily from the work of technologist Jaron Lanier, who advocates for “data dignity” — the principle that every digital artifact carries a human touch that deserves recognition. Imagine an AI generating an animated movie that draws inspiration from various artists — Microsoft wants to create a system that can precisely track and credit those inspirational sources.

The research is not just about providing proper recognition — it’s also about creating a framework for equitable compensation, changing the way we view creative contributions to generative AI. Indeed, as this article notes, this approach is not just a technical upgrade — it’s a cultural shift that recognizes the human element behind every line of code and every stroke of digital art. By developing methods to track and potentially compensate data contributors, the company is laying the groundwork for a more just and transparent AI ecosystem, marking a promising step toward reconciling technological advancement with the fundamental rights of content creators.

This initiative comes at a crucial time, with numerous legal battles surrounding AI training data. Companies like The New York Times have filed lawsuits arguing that their copyrighted content was used without permission. Microsoft’s approach offers a proactive solution — a transparent system that could potentially mitigate these legal challenges while creating a more equitable ecosystem for content creators.