If you’ve been wondering whether technology companies can use copyrighted books and other works to train large language models (LLMs), you’re not alone. Recent court decisions tell a nuanced story: training can be fair use in some circumstances, but how you acquire the books—and what you do with them—matters a lot.
How you source data can make or break you
In Bartz v. Anthropic PBC, 787 F. Supp. 3d 1007, 1025–26 (N.D. Cal. Jun. 23, 2025), Judge William Alsup took a hard stance against downloading millions of pirated books to build a “central library,” rejecting the idea that you can excuse piracy just because some copies might later be used for training. The judge doubted that taking books from pirate sites when lawful copies were available could ever be “reasonably necessary” to any fair use and called such piracy “inherently, irredeemably infringing,” even if the copies were immediately used for a transformative purpose and then discarded.
The court drew a sharp distinction: using purchased books to train an LLM and a print-to-digital format change for purchased copies were fair uses, but creating a library of pirated works, regardless of the reason for doing so, is not unequivocally fair use. That issue will proceed to trial on liability and damages.
In Kadrey v. Meta Platforms, 788 F.Supp.3d 1026 (N.D. Cal. June 25, 2025), Judge Vince Chhabria rejected an “automatic win” theory because the books were sourced from online “shadow” libraries rather than lawfully purchased copies. However, the court found that it wasn’t proper to completely separate the act of downloading from the act of training: even though they’re different acts, the downloading must be considered in light of the ultimate, highly transformative purpose of training. Because Meta’s use of books to train its Llama model had a “further purpose” and “different character” from the books themselves, and was “highly transformative,” the downloading was too, regardless of where the books came from.
Now, Meta is claiming that any uploading of data that occurred during the torrenting of books from shadow libraries also is fair use as “part and parcel” of its training process. However, it remains to be seen whether that is a bridge too far. In response, the plaintiffs were permitted to add a claim for contributory infringement, though not without Judge Chhabria chastising them for not doing so sooner.
Ultimately, courts are looking past subjective intent and focusing on what the user actually does with the works. The same copy can be used one way, then another, with different outcomes in a fair use analysis. However, calling something “research” does not excuse building a central repository of pirated books as a substitute for paid copies. As for whether Meta succeeds in also having its uploading excused as “fair use,” that remains to be seen.
Where the market-harm fight is headed
In Kadrey, the court explained that even though fair use is fact-specific and there’s no automatic win just because a use is transformative, market harm can outweigh even highly transformative uses. In LLM training cases, a plaintiff’s best shot may be to show that the model’s outputs will meaningfully substitute for the originals or otherwise significantly harm the market for those works. It also could matter whether downloading from online “shadow” libraries props up the pirate sites (e.g., through ad revenue), but the plaintiffs in Kadrey offered no evidence of that. They also did not assert a contributory infringement claim, which could prove fatal to their argument that Meta should be liable for its data uploads.
To establish that market harm has occurred, the court opined that a plaintiff could argue (a) regurgitation or substantially similar outputs, (b) a licensing market for training, and/or (c) indirect substitution where models generate genre-similar works that compete. In the Meta case, (a) and (b) failed on the record, and while (c) was theoretically stronger, it was not proven well enough to avoid summary judgment.
In the case of In re Mosaic LLM Litigation, 2025 WL 2294910, at *4 (N.D. Cal. Aug. 8, 2025), Magistrate Judge Lisa J. Cisneros noted that the courts of appeal haven’t yet weighed in on factor four for generative AI and suggested that, as AI proliferates, the right to license one’s work for AI training may become something copyright owners “rightly expect to control,” with evidence of a “ready market” (including big-tech licensing talks) being relevant to the analysis. For plaintiffs concerned about this use of their works, attention should be paid to whether such a market exists and, if not, how such a market might be created to establish a value for the works.
Practical takeaways for AI builders and rightsholders
No case has gone to trial or been resolved by any appellate court…yet. Not only is the jury still out, it hasn’t been selected. Even still, there are some practical takeaways to draw from these cases:
- Source responsibly: Courts have distinguished between fair-use training and the unfair creation of a pirated central library; acquiring from pirate sites when you can buy or license the works is a major risk, and retaining such copies can drive liability and willfulness exposure.
- Keep evidence on outputs: Plaintiffs who can show regurgitation, substantial similarity, or real substitution effects are better positioned to overcome a defendant’s “transformative” argument.
- Watch the licensing market: Demonstrating a functioning or “ready” market to pay for training uses can weigh against fair use; ongoing licensing negotiations may be discoverable and probative.
- Context matters for downloads: Courts may assess downloading in light of the end use; absent proof that downloads supported piracy operations, sourcing from shadow libraries was not automatically disqualifying in one case, though another court treated mass piracy for a central library as infringing irrespective of the intent to train an LLM.
Bottom line
These court decisions signal that training LLMs on copyrighted works can qualify as fair use, but it’s not a blank check. The strongest path for users of copyrighted material appears to be lawful sourcing, narrow retention aligned with the training purpose, and rigorous controls to avoid output substitution.
For rightsholders, the battleground is market harm: build a record on output substitution and the existence of a real licensing market.
The law is still developing, so prudence and careful planning are the order of the day, for both LLM developers and rightsholders.
Disclaimer
The foregoing is for informational purposes only, is not legal advice, and is not intended to solicit or to create an attorney-client relationship. Readers should not rely or act upon this information but should seek legal counsel regarding the facts and circumstances of their particular situation.