Menu
Alien Papers
About
Contact
Content
Verticals
Science
Health
Art
Other verticals
Whitepaper

Navigating the legal challenges of training datasets in GenAI models

Navigating the legal challenges of training datasets in GenAI models
Navigating the legal challenges of training datasets in GenAI models
Scroll for more
Scroll for more

In the previous discussions, we examined the nuances of copyright protection for AI-generated content and the intricate matters surrounding potential copyright infringement by generative AI (GenAI). This article delves into another core aspect of GenAI: the raw material, namely the training data, which empowers these systems to generate various creative outputs.

Training datasets form the foundation upon which GenAI models are constructed. These datasets typically encompass a broad spectrum of data, from publicly available information to proprietary and copyrighted material. The inclusion of copyrighted content in these training datasets presents significant intellectual property rights challenges.

The role of copyright in protecting works used for AI training purposes is currently a key point of debate across policy, academic, and legal domains. Parliamentary debates and litigation efforts seek to determine whether copyright protection should extend to the use of copyrighted content for AI training and, by extension, whether copyright holders should be entitled to claim royalties for the use of their works in these training processes.

Over the past two years, nearly 20 new copyright infringement lawsuits have been filed in the US against prominent AI companies, including OpenAI, Midjourney, Stability AI, Nvidia, and Microsoft. These cases typically involve artists, book authors, and major image repositories such as Getty Images who argue that their copyrighted material has been used without authorization. A notable case involved Midjourney, where a significant data breach exposed the unauthorized use of artworks from over 16,000 artists for AI training, leading to legal action.

Companies like OpenAI acknowledge using large amounts of copyrighted data for training their AI systems, contending that the transformative nature of AI training constitutes a legitimate use under the fair use doctrine, as outlined in section 107 of the US Copyright Act. This doctrine assesses the copyright fairness on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount used, and the impact on the work’s market value.

For instance, a September 2023 ruling by a US district court called for a jury trial to decide whether it was fair use for an AI company to copy case summaries from Westlaw for training purposes. Similarly, Getty Images’ lawsuit against Stability AI challenges the fair use defense, arguing that Stability AI’s actions adversely affect the market for the original works.

Fair use claims regarding the use of copyrighted works in data mining were already addressed in previous judicial precedents. In 2015, Google was sued for digitally scanning millions of books without obtaining the authors’ consent, to compile a searchable digital database. In this case, the US Court of Appeal held that Google’s practice of digitizing books without authorization was a “fair use” given the “highly transformative purpose” of Google’s use of this material — i.e., Google was transforming many diverse copyright-protected books into a new, useful search mechanism that didn’t compete with the underlying works.

However, recent proceedings, such as Nvidia’s lawsuit involving the use of copyrighted books to train its NeMo model, highlight ongoing disputes in this area. Additionally, a class action filed by developers against OpenAI, GitHub, and Microsoft over the training of Copilot software on their copyrighted computer code which was available on Github. They argued that such training constituted copyright infringement since Copilot’s output resembled code licensed by GitHub users. Nonetheless, the Court found this argument insufficient for proving harm, since — even if there was evidence that OpenAI had used GitHub repositories to train their model — there was no evidence that the plaintiffs’ specific code was included in those outputs.

However, despite an initial jurisprudence that is relatively supportive to the training of generative AI models, it is not unlikely that, in the years to come, as GenAI continues to develop, courts and legislators could decide that, without proper authorization or licensing agreements, the use of copyrighted material as training dataset may infringe the rights of content creators and copyright holders. In that regard, in February 2024, the Chinese Internet Court in Guangzhou issued a landmark ruling, determining that reproducing copyrighted works for training an AI system without proper authorization constitutes copyright infringement, thus holding the responsible company liable for civil damages.

The discussion extends beyond copyright into privacy concerns, especially when training datasets that contain personal data. Incidents in Italy, such as the temporary ban on ChatGPT and the investigation into Sora’s training practices., underscore the critical need for compliance with privacy standards like the GDPR.

As GenAI continues to evolve, it is likely that both courts and legislators will increasingly examine the use of copyrighted material in training datasets. This could lead to more rigorous requirements for authorization or licensing agreements, ensuring that the rights of content creators and copyright holders are respected. The upcoming articles will continue to delve deeper into the intricacies of the legal status of GenAI models, while proposing possible solutions to address these legal complexities.

— — — — —

5 min read
by Primavera De Filippi
Share this post on :
Copy Link
X
Linkedin
Newsletter subscription
Related Papers
Let’s build what’s next, together.
Let’s build what’s next, together.
Let’s build what’s next, together.
Close