
Navigating the legal challenges of training datasets in GenAI models

In the previous discussions, we examined the nuances of copyright protection for AI-generated content and the intricate matters surrounding potential copyright infringement by generative AI (GenAI). This article delves into another core aspect of GenAI: the raw material, namely the training data, which empowers these systems to generate various creative outputs.
Training datasets form the foundation upon which GenAI models are constructed. These datasets typically encompass a broad spectrum of data, from publicly available information to proprietary and copyrighted material. The inclusion of copyrighted content in these training datasets presents significant intellectual property rights challenges.
The role of copyright in protecting works used for AI training purposes is currently a key point of debate across policy, academic, and legal domains. Parliamentary debates and litigation efforts seek to determine whether copyright protection should extend to the use of copyrighted content for AI training and, by extension, whether copyright holders should be entitled to claim royalties for the use of their works in these training processes.
Over the past two years, nearly 20 new copyright infringement lawsuits have been filed in the US against prominent AI companies, including OpenAI, Midjourney, Stability AI, Nvidia, and Microsoft. These cases typically involve artists, book authors, and major image repositories such as Getty Images who argue that their copyrighted material has been used without authorization. A notable case involved Midjourney, where a significant data breach exposed the unauthorized use of artworks from over 16,000 artists for AI training, leading to legal action.
Companies like OpenAI acknowledge using large amounts of copyrighted data for training their AI systems, contending that the transformative nature of AI training constitutes a legitimate use under the fair use doctrine, as outlined in section 107 of the US Copyright Act. This doctrine assesses the copyright fairness on a case-by-case basis, considering factors such as the purpose and character of the use, the nature of the copyrighted work, the amount used, and the impact on the work’s market value.
For instance, a September 2023 ruling by a US district court called for a jury trial to decide whether it was fair use for an AI company to copy case summaries from Westlaw for training purposes. Similarly, Getty Images’ lawsuit against Stability AI challenges the fair use defense, arguing that Stability AI’s actions adversely affect the market for the original works.
Fair use claims regarding the use of copyrighted works in data mining were already addressed in previous judicial precedents. In 2015, Google was sued for digitally scanning millions of books without obtaining the authors’ consent, to compile a searchable digital database. In this case, the US Court of Appeal held that Google’s practice of digitizing books without authorization was a “fair use” given the “highly transformative purpose” of Google’s use of this material — i.e., Google was transforming many diverse copyright-protected books into a new, useful search mechanism that didn’t compete with the underlying works.
However, recent proceedings, such as Nvidia’s lawsuit involving the use of copyrighted books to train its NeMo model, highlight ongoing disputes in this area. Additionally, a class action filed by developers against OpenAI, GitHub, and Microsoft over the training of Copilot software on their copyrighted computer code which was available on Github. They argued that such training constituted copyright infringement since Copilot’s output resembled code licensed by GitHub users. Nonetheless, the Court found this argument insufficient for proving harm, since — even if there was evidence that OpenAI had used GitHub repositories to train their model — there was no evidence that the plaintiffs’ specific code was included in those outputs.
However, despite an initial jurisprudence that is relatively supportive to the training of generative AI models, it is not unlikely that, in the years to come, as GenAI continues to develop, courts and legislators could decide that, without proper authorization or licensing agreements, the use of copyrighted material as training dataset may infringe the rights of content creators and copyright holders. In that regard, in February 2024, the Chinese Internet Court in Guangzhou issued a landmark ruling, determining that reproducing copyrighted works for training an AI system without proper authorization constitutes copyright infringement, thus holding the responsible company liable for civil damages.
The discussion extends beyond copyright into privacy concerns, especially when training datasets that contain personal data. Incidents in Italy, such as the temporary ban on ChatGPT and the investigation into Sora’s training practices., underscore the critical need for compliance with privacy standards like the GDPR.
As GenAI continues to evolve, it is likely that both courts and legislators will increasingly examine the use of copyrighted material in training datasets. This could lead to more rigorous requirements for authorization or licensing agreements, ensuring that the rights of content creators and copyright holders are respected. The upcoming articles will continue to delve deeper into the intricacies of the legal status of GenAI models, while proposing possible solutions to address these legal complexities.
— — — — —
- Example of a UK parliamentary debate transcript on February 2023 on AI and IP rights: https://hansard.parliament.uk/commons/2023-02-01/debates/7CD1D4F9-7805-4CF0-9698-E28ECEFB7177/ArtificialIntelligenceIntellectualPropertyRights
- See e.g. AI v copyright: how could public interest theory shift the discourse?, Journal of Intellectual Property Law & Practice, Volume 19, Issue 1, January 2024, Pages 55–63, 29th December 2023, available at: https://academic.oup.com/jiplp/article/19/1/55/7503818 (accessed on 04.03.2024); ChatGPT: A Case Study on Copyright Challenges for Generative Artificial Intelligence Systems, European Jorunal of Risk Regulation, Cambridge Univerity Press, 29 August 2023, available at: https://www.cambridge.org/core/journals/european-journal-of-risk-regulation/article/chatgpt-a-case-study-on-copyright-challenges-for-generative-artificial-intelligence-systems/CEDCE34DED599CC4EB201289BB161965 (accessed on 04.03.2024). Both of these articles propose pragmatic measures to align copyright practices with the realities of AI training data usage, aiming to secure fair use rights and equitable compensation.
- See this summary article on generative AI-related lawsuits. Joe Panettieri (Sustainable Tech Partner, March 2024): https://sustainabletechpartner.com/topics/ai/generative-ai-lawsuit-timeline/
- See Map of 20 Copyright Lawsuits v. AI companies, 1st April 2024, available at: https://chatgptiseatingtheworld.com/2024/04/01/map-of-20-copyright-lawsuits-v-ai-companies/ (accessed on 04.02.2024)
- See Sarah Andersen et al v. Stability AI Ltd, MidJourney Inc, Deviantart, Inc, United States District Court Northern District of California San Francisco Division, available at: https://stablediffusionlitigation.com/pdf/00201/1-1-stable-diffusion-complaint.pdf
Other complaints were made on behalf of book authors challenging ChatGPT and LLaMA. See https://stablediffusionlitigation.com/ (accessed on 03/26/2024).
- See Sarah Andersen et al v. Stability AI Ltd, MidJourney Inc, Deviantart, Inc, United States District Court Northern District of California San Francisco Division, available at: https://stablediffusionlitigation.com/pdf/00201/1-1-stable-diffusion-complaint.pdf
- See Data Leak: Midjourney’s Unauthorised Use of 16,000+ Artists’ Works Sparks Legal and Ethical Showdown, Medium, 9th January 2024, available at: https://medium.com/@calebpr/data-leak-midjourneys-unauthorised-use-of-16-000-artists-works-sparks-legal-and-ethical-56b862899e6f (accessed on 04.02.2024)
- OpenAI argues its purpose is transformative since the training process creates “a useful generative AI system”. This view is supported by legal precedents that recognize reproducing copyrighted materials for computational data analysis as legitimate. See Comment of OpenAI LP Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation Before the United States Patent and Trademark Office Department of Commerce, available at: https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf (accessed on 03.27.2024)
- See “Public Views on Artificial Intelligence and Intellectual Property”, United States Patents and Trademark Office, March 2020, available at: https://www.uspto.gov/sites/default/files/documents/USPTO_AI-Report_2020-10-07.pdf (accessed on 03.27.2024)
- The trial on the copyright issues will begin on August 26, 2024, at 9:00 a.m. in Wilmington, Delaware and will last five days. See Thomson Reuters Enterprise Centre GMBH and West Publishing Corp. v. Ross Intelligence Inc., United States District Court for the District of Delaware, filed on 25th September 2024, available at: https://storage.courtlistener.com/recap/gov.uscourts.ded.72109/gov.uscourts.ded.72109.547.0_3.pdf (accessed on 04.04.2024)
- See https://www.bakerlaw.com/getty-images-v-stability-ai/ (accessed on 03/26/2024) and Zirpoli, Christopher T., Generative Artificial Intelligence and Copyright Law, Congressional Research Service, 29th September 2023, available at: https://crsreports.congress.gov/product/pdf/LSB/LSB10922 (accessed on 03.27.2024)
- See Guild v. Google, Inc., №13–4829, United States Court of Appeals for the Second Circuit, 16 October 2015, available at: https://law.justia.com/cases/federal/appellate-courts/ca2/13-4829/13-4829-2015-10-16.html (accessed on 03.28.2024)
- See Nazemian et al v. NVIDIA Corporation, 3:24-cv-01454, case filed 8 March 2024, available at: https://www.pacermonitor.com/public/case/52650939/Nazemian_et_al_v_NVIDIA_Corporation (accessed on 04.01.2024)
- See DOE 1 v. GitHub, Inc, (4:22-cv-06823), District Court, N.D. California available at: https://www.courtlistener.com/docket/65669506/doe-1-v-github-inc/?filed_after=&filed_before=&entry_gte=&entry_lte=&order_by=desc (accessed oa 04.01.2024)
- Yuying, Zhang, China issues world’s 1st legally binding verdict on copyright infringement of AI-generated images, Global Times, 27 February 2024, available at: https://www.globaltimes.cn/page/202402/1307805.shtml (accessed on 04.02.2024)
- Italy becomes first western nation to ‘ban’ ChatGPT, The Times, available at: https://www.thetimes.co.uk/article/italy-becomes-first-western-nation-to-ban-chatgpt-csvnt6w3w?gad_source=1&gclid=CjwKCAjw_LOwBhBFEiwAmSEQAaPwpEHxPC8NrI5N71hIcR_zO7dzG8rR_uvSOOsvazVqGOchC4JAyxoCfLoQAvD_BwE (accessed on 04.03.2025)
- Lindberg, Martina, Italian Data Protection Authority investigates new OpenAI model Sora, Grip, 14th March 2024, available at: https://www.grip.globalrelay.com/italian-data-protection-authority-investigates-new-openai-model-sora/ (accessed on 04.02.2024)


