A giant online book collection Meta used to train its AI is gone over copyright issues

2023-08-18 23:21

AI as we know it basically exists to eat up the internet and spit it

A giant online book collection Meta used to train its AI is gone over copyright issues

AI as we know it basically exists to eat up the internet and spit it back out at you. The problem with that is huge parts of the internet are protected by copyright law.

That's one of the major takeaways from the gigantic Books3 database getting taken down following a DMCA request by the Danish anti-piracy group Rights Alliance, as originally reported by TorrentFreak. Books3 contained a little more than 196,000 books in plain-text format for AI models to chew on for training purposes, but aside from a few alternate links floating around the internet, it's no longer publicly accessible. The old link to it goes to a 404 page.

Books3 existed as part of a larger collection of AI training content called The Pile, organized by the research group EleutherAI. As noted by a Gizmodo report on the subject, Meta has referenced using The Pile for training its in-house AI model before. It wouldn't be the first big tech AI model to potentially be trained on illegally disseminated material, as a class-action lawsuit filed in July accused Google of doing the same thing.

This stuff gets tricky fast in the legal sense, but also in the ethical sense. For instance, a person who might be in favor of piracy in general for historical archival purposes could also vehemently oppose AI models being trained on copyrighted material (I feel like I know several people who think this way). It's also easy to understand why authors would oppose their work being used this way, as the makers of these AI models could theoretically profit off of other people's work in the future.

The only thing that's certain is that these battles are only going to get messier from here.