How to Train LLMs using Copyrighted Books

If a ‘product’/LLM is trained using copyrighted books, what is the harm in it? Isn’t that what search engines do too to provide better results to the user? If you want to ban LLMs (Large Language Models) from reading books, then by the same logic, listing of books on search engines like Google also should be banned, because in a way a search engine is also a ‘digital brain’ and processes all the data in myriad ways that come to it.

Surprisingly, the recent case of Anthropic is one of the very few cases where a corporate giant actually took responsibility for its actions. Not for once, in the judgment delivered by the Court of Northern District of California, it is reflected that Anthropic made any evasive or misleading attempts to extract a favourable outcome. This is actually laudable. In the face of adversity, Anthropic chose to stay with the truth.

The violation of copyright law by training of LLMs is a long-standing issue and the reasoning by the court in the case of Andrea Bartz & Others v. Anthropic PBC, categorically demonstrates that unless the LLM is blatantly divulging the exact contents of a copyrighted work to the user, it will not be considered as a violation or infringement.

To me, it makes a lot of sense that if somebody has legitimately purchased books and trains its LLM on them, the same must not be prohibited. It would be akin to saying that reading books is a crime because in this new reality of AI, we must acknowledge that LLMs are digital brains. They may not be conscious, but they are surely capable of understanding the context of words just like a biological brain does. These digital brains are merely products that are offered by different AI companies to their users.

Regarding pirated copies of copyrighted works, the court has left the question open, and it has observed that the same may not be capable of being termed as ‘fair use’. My own view is that in case of pirated copies too, a nuanced view is required to be adopted. If the LLM company is able to demonstrate that it obtained the pirated copies from a publicly available source, it should be given the benefit of doubt and may be prospectively asked to compensate the copyright owner. However, if the LLM company itself becomes a pirate and directly steals the copyrighted work firsthand by hacking or whatever means, then it may not be given the benefit of doubt. This is a very debatable issue and must be deliberated upon at length.

The intellectual property laws have the tendency to restrict the flow of knowledge. I am not saying that blatant stealing of works without paying a dime to the copyright owner should be allowed. However, at the same time, if a person has purchased a copyrighted work, he should be allowed to do so anything that he so pleases within the broad framework of ‘fair use’.

I also concur with the observation of the court in Anthropic ruling that ‘fair use’ cannot be used to take into account the speculative or contemplated losses that might occur later on. The copyright owner cannot seek any relief against the competition it faces.

Suppose I write a book today. It becomes famous, and later on, other people start writing on the same ideas that I had proposed in my work, and end up writing better than me. Can I claim protection against such competition? Never!

Similarly, if an LLM reads a book and is able to explain the ideas in it to the users in a better manner than the book itself, can the LLM be blamed for the same? I guess not. If such a restrictive approach is adopted, it would mean stifling all intellectual growth of both biological brains and digital brains.

We must also acknowledge that digital brains are collective manifestations of biological brains albeit in an extremely limited manner. What AI researchers are trying to accomplish is consolidating the sum total of human knowledge into digital brains in a manner that the digital brains are able to sift through them easily and provide fast but accurate responses.

People criticizing the Anthropic ruling mostly have myopic visions about the nature of copyright law. Copyright law is not an end in itself. It is merely a means to an end, the end being dissemination of human knowledge to further new ideas and growth. The purpose of any ‘law’ is to promote a level playing field and nothing more. Any law that creates monopolies is a law that deserves to be repealed.

As lawyers, in the subject of jurisprudence, we study that there is something called ‘higher law’. In ancient times, religious edicts were considered to be the higher law. In modern times, principles of natural justice, principles relating to equity, reasonableness and good conscience etc. are considered to be the higher law. Law must not be allowed to be used as a tool to subserve vested interests. A good law is one that merely balances the conflicting interests prevalent in the society and leaves the rest of the things to the good wisdom of the people.

I think this is exactly what the court of Northern District of California has done in the Anthropic’s case. It has adopted a common-sense approach and has refused to adopt hyper-technical interpretations.

Let’s see how the court interprets the liability of LLM companies with respect to pirated copies of works.

If you like this article, please do not forget to like and subscribe. 🙂

Categorized in:

AI Law Philosophy

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

How to Train LLMs using Copyrighted Books

Other Stories

What is ‘Context Engineering’ for Noobs?

Virtual Cells: Science Fiction to Reality

OpenAI’s Response to xAI’s Trade Secrets Misappropriation Allegation

Grokipedia and Encyclopedia Galactica

Big and Detailed is not always Better

Press ESC to close

Or check our Popular Categories...

Related Articles

Other Stories

What is ‘Context Engineering’ for Noobs?

Virtual Cells: Science Fiction to Reality