The important parts:
> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use
> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"
It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.
Here is how individuals are treated for massive copyright infringement:
https://investors.autodesk.com/news-releases/news-release-de...
Apparently it's a common business practice. Spotify (even though I can't find any proof) seems to have build their software and business on pirated music. There is some more in this Article [0].
https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...
Funky quote:
> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.
Anthropic's cofounder, Ben Mann, downloaded million copies of books from Library Genesis in 2021, fully aware that the material was pirated.
Stealing is stealing. Let's stop with the double standards.
Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books. I'm not saying this is justified, but what would you have done in their situation?
Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?
These are the people shaping the future of AI? What happened to all the ethical values they love to preach about?
We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?
If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.
It’s easy to point fingers at others. Meanwhile the top comment in this thread links to stolen content from Business Insider.
Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use, but the legal industry disagrees.
Which of the following are true?
(a) the legal industry is susceptible to influence and corruption
(b) engineers don't understand how to legally interpret legal text
(c) AI tech is new, and judges aren't technically qualified to decide these scenarios
Most likely option is C, as we've seen this pattern many times before.
I'm not seeing how this is fair use in either case.
Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?
It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.
By the way I wonder if recent advancement in protecting Youtube videos from downloaders like yt-d*p are caused by unwillingness to help rival AI companies gather the datasets.
If the AI movement will manage to undermine Imaginary Property, it would redeem it's externalities threefold.
The buried lede here is Antrhopic will need to attempt to explain to a judge that it is impossible to de-train 7M books from their models.
Anyone read the 2006 sci-fi book Rainbow's End that has this? It was set in 2025.
actual title:
"Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."
A not-so-subtle difference.
That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.
Anyone else thinks destroying books for any reason is wrong ?
Or is it perhaps not an universal cultural/moral aspect ?
I guess for example in Europe people could be more sensitive to it.
Order on Fair Use
https://ia800101.us.archive.org/15/items/gov.uscourts.cand.4...
Based on the fact people went to jail for downloading some music or movies, this guy will face a lifetime in prison for 7 million books that he then used for commercial profit right?
Right guys we don't have rules for thee but not for me in the land of the free?
Same did Meta and probably other big companies. People who praise AGI are very short sighted. It will ruin the world with our current morals and ethics. It's like a nuclear weapon in the hands of barbarians (shit, we have that too, actually).
So using the standard industry metrics for calculating the financial impact of piracy, this would equate to something like trillions of damages to the book publishing industry?
If AI companies are allowed to use pirated material to create their products, does it mean that everyone can use pirated software to create products? Where is the line?
Also please don't use word "learning", use "creating software using copyrighted materials".
Also let's think together how can we prevent AI companies from using our work using technical measures if the law doesn't work?
They've all done that, it should be obvious by now. Training on just freely available data only gets you so far.
Let’s say my AI company is training an AI on woodworking books and at the end, it will describe in text and wireframe drawings (but not the original or identical photos) how to do a particular task.
If I didn’t license all the books I trained on, am I not depriving the publisher of revenue, given people will pay me for the AI instead of buying the book?
something i've been trying to reconcile: i buy a cheap used book on biblio and i'm morally ok even though the writer doesn't get paid. but if i pirate the book, then i'm wrong for that because the writer doesn't get paid?
The article doesn't say who is suing them. Is it a class action? How many of these 7M pirated books have they written? Is it publishing houses? How many of these books are relevant in this judgement?
Hang on, it is OK under copyright law to scan a book I bought second hand, destroy the hard copy and keep the scan in my online library? That doesn't seem to chime with the copyright notices I have read in books.
> "Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different," he wrote.
But this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it. And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material. Also people do not read millions of books to become a writer.
as far as I understand while training on books is clearly not fair use (as the result will likely hurt the lively hood of authors, especially not "best of the best" authors).
as long as you buy the book it still should be legal, that is if you actually buy the book and not a "read only" eBook
but the 7_000_000 pirated books are a huge issue, and one from which we have a lot of reason to believe isn't just specific to Anthropic
It’s marginally better than Meta torrenting z-lib.
Buying, scanning, and discarding was in my proposal to train under copyright restrictions.
You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.
This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.
1980's: Johnny No. 5 need input!
2020's: (Steals a bunch of books to profit off acquired knowledge.)
It is shocking how courts have being ruling towards the benefits of ai companies despite the obvious problem of allowing automatic plagiarism
The title is clearly meant to generate outrage, but what is wrong with cutting up a book that you own?
When Aaron Schwartz did it, he ended up dying.
Two of the top AI companies flouted ethics with regard to training data. In OpenAI's case, the whistleblower probably got whacked for exposing it.
Can anyone make a compelling argument that any of these AI companies have the public's best interest in mind (alignment/superalignment)?
Alsup detailed Anthropic's training process with books: The OpenAI rival
spent "many millions of dollars" buying used print books, which the
company or its vendors then stripped of their bindings, cut the pages,
and scanned into digital files.
I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market.If ingesting books into an AI makes Anthropic criminals, then Google et al are also criminals alike for making search indexes of the Internet. Anything published online is equally copyrighted.
under the DMCA the minimum penalty for an illegally downloaded file is $750 (https://copyrightresource.uw.edu/copyright-law/dmca/)
"Anthropic had no entitlement to use pirated copies for its central library...Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy." --- the ruling
If they committed piracy 7 million times and the minimum fine for each instance is $750 million then the law says that anthropic is liable for $5.25 billion. I just want it to be out there that they definitely broke the law and the penalty is a minimum $5.25 billion in fines according to the law, this way when none of this actually happens we at least can't pretend we didn't know.
Should have listened to those NordVPN ads on YouTube
Hopefully they were all good books at least.
So, how should we as a society handle this?
Ensure the models are open source, so everyone can use them, as everyones data is in there?
Close those companies and force them to delete the models, as they used copyright material?
I'm curious - do the people here who think copyright shouldn't exist also think trademark shouldn't exist?
Two week old news.
Some previous discussions:
https://news.ycombinator.com/item?id=44367850
If Anthropic is funded by Amazon, they should have just asked Amazon for unlimited download of EVERY book in the Amazon book store, and all audio-books as well. It certainly would be faster than buying one copy of each and tearing it apart.
The farce of treating a corporation as an individual precludes common sense legal procedure to investigate people who are responsible for criminal action taken by the company. Its obviously premeditated and in all ways an illicit act knowingly perpetrated by persons. The only discourse should be about upending this penthouse legalism.
I've begun to wonder if this is why some large torrent sites haven't been taken down. They are essentially able to crowdsource all the work. There are some users who spend ungodly amounts of time and money on these sites that I suspect are rich industry benefactors.
seems like the "mis" is missing from the name.
no one said anything when Google did it way before LLM is a thing
So if you incorporate you can do whatever you want without criminal charges?
The solution has always been: show us the training data.
As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.
On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.
Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.
My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.
[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).
[1] https://www.theverge.com/2024/8/14/24220658/google-eric-schm...
[2] https://en.wikipedia.org/wiki/List_of_public_corporations_by...
[3] https://companiesmarketcap.com/
[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!
Most of the comments missed the point. It's not that they trained on books, it's that they pirated the books.
From Vinge's "Rainbow's End":
> In fact this business was the ultimate in deconstruction: First one and then the other would pull books off the racks and toss them into the shredder's maw. The maintenance labels made calm phrases of the horror: The raging maw was a "NaviCloud custom debinder." The fabric tunnel that stretched out behind it was a "camera tunnel...." The shredded fragments of books and magazine flew down the tunnel like leaves in tornado, twisting and tumbling. The inside of the fabric was stitched with thousands of tiny cameras. The shreds were being photographed again and again, from every angle and orientation, till finally the torn leaves dropped into a bin just in front of Robert. Rescued data. BRRRRAP! The monster advanced another foot into the stacks, leaving another foot of empty shelves behind it.
Good, this is what Aaron Swartz was fighting for.
Against companies like Elsevier locking up the worlds knowledge.
Authors are no different to scientists, many had government funding at one point, and it's the publishing companies that got most of the sales.
You can disagree and think Aaron Swartz was evil, but you can't have both.
You can take what Anthropic have show you is possible and do this yourself now.
isohunt: freedom of information
Maybe to give something back to the pirates, Anthropic could upload all the books they have digitized to the archive? /s
I will never feel bad again for learning from copied books /S
Everybody that wants to train an LLM, should buy every single book, every single issue of a magazine or a newspaper, and personally ask every person that ever left a comment on social media. /s
If I was China I would buy every lawyer to drown western AI companies in lawsuits, because it's an easy way to win AI race.
Amazon has been doing this since the 2000's. Fun fact: This is how AWS came about; for them to scale its "LOOK INSIDE!" feature for all the books it was hoovering in an attempt to kill the last benefit the bookstore had over them.
Ie. This is not a big deal. The only difference now is ppl are rapidly frothing to be outraged by the mere sniff of new tech on the horizon. Overton window in effect.
https://archive.md/YLyPg