DeepSeek and the AI copyright wars: The empire strikes back?

Scroll for more

DeepSeek is a Chinese AI company that has recently gained attention for its advanced AI models, particularly DeepSeek R1, which was released on January 20th 2025 — just one day before the announcement of the Stargate Project by OpenAI, a new company which intends to invest $500 billion over the next four years to build new AI infrastructure for OpenAI in the United States. The release of R1 triggered a Silicon Valley panic attack. The reason is that R1 is not just another large language model (LLM) that merely predicts the next word based on patterns in text, it’s a reasoning model that has been specifically trained to achieve structured problem-solving and logical deduction, making it more akin to human reasoning. As of today, R1 rivals top reasoning models from OpenAI (e.g. o1) and Google (e.g. Gemini Flash Thinking), but at a fraction of the cost. Besides, it’s open source. Think ChatGPT with a sharper brain, a smaller waistline, and an MIT license stapled to its forehead.

Importantly, DeepSeek’s release of R1 isn’t just a matter of geopolitics and market competition, it’s also a legal earthquake that could reshape AI regulation worldwide.

A copyright nightmare in the making

The first red flag is the training data. DeepSeek R1 research paper doesn’t spell out exactly what it was trained on — which is not surprising, as this is a common practice in the generative AI landscape not to disclose the full extent of the training data set. Hence, we can only attempt to infer R1’s training dataset by looking at the training data of previous generative models developed by DeepSeek. Former LLMs were trained on CommonCrawl, whereas DeepSeek vision-language model (VLM) was trained, amongst other things, on a variety of repositories of unlicensed copyright works, such as LibGen, Sci-Hub and other sources of pirated papers and books. While there is no proof that R1 was trained on these ‘shadow libraries’, if it were, it would potentially bring a case for copyright infringement.

Meta already got blamed for allegedly using LibGen as part of LLama’s training data, and OpenAI is already fighting multiple lawsuits over its training practices. But here’s the twist: DeepSeek is based in China. Suing them for copyright infringement outside of China might be a challenge in and of itself. Besides, even if someone could build a case for copyright infringement, proving direct damage from AI training is still an uphill battle. If DeepSeek’s model doesn’t spit out copyrighted books verbatim, courts might struggle to see any harm done. In short, it’s the Wild West — except China gets to write the law.

Did DeepSeek “steal” from OpenAI?

OpenAI claims that DeepSeek engaged in model distillation — a process where you prompt another generative AI model with many questions, record its answers, and use that as training data for your own model. Think of it as cloning a Michelin-starred chef’s recipes by eating at their restaurant every night in order to learn to replicate their secret recipes. If true, this means DeepSeek effectively piggy-backed on OpenAI’s $500 million R&D budget to launch a competing model at a fraction of the cost.

But is this copyright infringement? Probably not. OpenAI might be able to argue a terms of service violation if DeepSeek scrapped its API, but that’s a breach of contract, not a copyright claim. And enforcing terms of service across international borders is another legal headache. Even if OpenAI won a lawsuit, it’s unclear how they’d stop DeepSeek from training another model using the same trick.

The future of AI copyright battles

DeepSeek’s emergence isn’t just another chapter in the AI copyright saga — it’s a plot twist. It forces us to rethink who controls AI training data, whether copyright law is enforceable at scale, and whether proprietary AI can survive against open-source alternatives.

1. Shadow libraries: pragmatism vs. copyright law

If DeepSeek R1’s strong performance is linked to training on shadow libraries like LibGen and Sci-Hub, this could set a dangerous precedent — or at least an unavoidable one. AI developers will take note: if pirated data improves AI reasoning, it’s only a matter of time before others follow suit, whether openly or in secret.

This creates a direct collision between copyright law and market pragmatism. AI companies aren’t driven by ethics; they’re driven by performance. If the best models emerge from unlicensed data, market forces will push developers toward that path — whether through plausible deniability, legal loopholes, or outright defiance.

Legally, enforcement is complicated. If DeepSeek operates primarily in China, Western copyright lawsuits won’t touch it. But what happens when open-source models trained on this data start circulating globally? Courts in the U.S. and EU will face a dilemma: either try to block AI models trained on unauthorized data (good luck policing that) or accept a new reality where AI is built on imperfect, legally gray foundations.

2. AI arms race: when geopolitics overrides copyright

The AI race isn’t just about technology — it’s about national power. The U.S., China, and the EU all see AI as a matter of economic and military dominance. In that context, copyright enforcement becomes a secondary concern at best.

The U.S. won’t let copyright lawsuits cripple domestic AI development if China is forging ahead. If AI models trained on copyrighted material prove significantly more capable, policymakers might push for legal carve-outs to shield AI companies from copyright liability, just as they’ve done for tech firms in the past (e.g., DMCA safe harbors, Section 230 for platforms).

The EU, meanwhile, has taken a stricter approach with its AI Act, which could mean that the EU is doomed to fall behind. China, for its part, has no intention of slowing down — meaning global AI regulation may eventually bend toward competitiveness, not copyright purity.

3. Open Source vs. closed AI models: a new competitive pressure

DeepSeek’s open-source release of its reasoning model creates a strategic dilemma for proprietary AI companies like OpenAI, Google, and Anthropic. They’ve built their business models on closed, high-cost development, assuming their models would always be superior to open-source alternatives. But what if they aren’t?

If open-source AI can achieve comparable or better performance, the closed-model approach faces the existential threat of market disruption. Why pay OpenAI or Google for access when you can run a state-of-the-art model locally for free?

Besides, the entire premise of copyright lawsuits against proprietary models may collapse if open-source AI models are developed by community-driven innovation. If the best AI models are open and free, copyright fights over training data may become a footnote in history.

Conclusion: A copyright reckoning is coming

DeepSeek release of R1 raises serious questions about copyright enforcement, data ownership, and whether copyright law is even relevant in an AI-dominated world. If copyright is supposed to incentivize human creativity, what happens when the most advanced “creators” are machines that remix and reassemble information faster than we can legislate?

DeepSeek R1 isn’t just another AI model — it’s a stress test for AI copyright law. If shadow libraries prove too valuable to ignore, if national AI policy overrides copyright concerns, and if open-source AI outpaces proprietary models, then the copyright battles we’re seeing today may soon feel outdated, futile, or both.

Ultimately, if DeepSeek can prove anything, it is that the age of AI copyright wars is just beginning, and no one — neither the corporations nor the regulators — knows how this story ends. The fundamental question isn’t whether AI will change copyright — it’s whether copyright is capable of withstanding AI.