Issue 113: More on Copyright and Foundational AI Models
Two years ago this month, I wrote a DLTJ Thursday Threads article on the copyright implications of foundational AI models. A lot has happened in those 24 months. This issue mostly focuses on lawsuits, plus an announcement of a service offering image generation from licensed content. These articles highlight the growing tension between content creators and technology companies as AI technologies increasingly rely on large datasets that include licensed and, in some instances, pirated content.
- From late 2023, the New York Times sues OpenAI and Microsoft for alleged copyright infringement in AI training (with late-breaking update).
- U.S. judge partially favors OpenAI while permitting unfair competition claim in authors' copyright lawsuit in this ruling from early 2024.
- Last month Thomson Reuters wins landmark U.S. AI copyright case, potentially establishing a legal precedent.
- Microsoft guarantees legal protection for Copilot users from copyright lawsuits.
- Meta's training of its AI with pirated LibGen books sparks legal and ethical debate.
- Nvidia denies copyright infringement in the use of shadow libraries for AI training.
- Getty Images launched an AI image generator using its licensed library in 2023.
- This Week I Learned: "But where is everybody?!?" — the origins of Fermi's Paradox
- This week's cat
Also on DLTJ this past week:
- In OCLC v Anna's Archive, New/Novel Issues Sent to State Court: The case of OCLC against Anna's Archive, accused of “data scraping” from OCLC's WorldCat, takes a turn as the U.S. District Court for the Southern District of Ohio decides to certify several “novel and unsettled” legal questions to the Supreme Court of Ohio.
- My protest signage improved at this week's #TeslaTakedown: My improved sign said "Our GOVERNMENT was fine. Now it is MUSKed UP! FIRE ELON!" Read the post for instructions on printing your own copy of this protest sign.
Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.
New York Times sues OpenAI and Microsoft for alleged copyright infringement in AI training
The New York Times sued OpenAI and Microsoft on Wednesday over the tech companies’ use of its copyrighted articles to train their artificial intelligence technology, joining a growing wave of opposition to the tech industry’s using creative work without paying for it or getting permission. OpenAI and Microsoft used “millions” of Times articles to help build their tech, which is now extremely lucrative and directly competes with the Times’s own services, the newspaper’s lawyers wrote in a complaint filed in federal court in Manhattan.
We're starting with the lawsuits, and this is one of the bigger ones. At the time the lawsuit was filed, OpenAI announced deals with content providers to use their backfiles of content, but the New York Times was a holdout. The lawsuit claims that OpenAI and Microsoft used millions of Times articles, which directly competes with the newspaper's services. While OpenAI maintained that it respects content creators' rights and believes its practices fall under fair use, the lawsuit cites instances of AI reproducing Times articles verbatim. This case has had many twists and turns, including a report last year that OpenAI intentionally trashed the research of the Times' lawyers. You can follow along with the court case in the Southern District of New York.
LATE BREAKING NEWS: As I was finishing the edits on this issue, I saw the judge issued a brief ruling on the defendant's motion to dismiss. In short, the lawsuit continues, but the portions on "common law unfair competition by misappropriation claims" are dismissed. The full version of the opinion hasn't been released yet, but should be coming soon!
US judge favors OpenAI, permits unfair competition claim in authors' copyright lawsuit
A US district judge in California has largely sided with OpenAI, dismissing the majority of claims raised by authors alleging that large language models powering ChatGPT were illegally trained on pirated copies of their books without their permission. By allegedly repackaging original works as ChatGPT outputs, authors alleged, OpenAI's most popular chatbot was just a high-tech "grift" that seemingly violated copyright laws, as well as state laws preventing unfair business practices and unjust enrichment. According to judge Araceli Martínez-Olguín, authors behind three separate lawsuits—including Sarah Silverman, Michael Chabon, and Paul Tremblay—have failed to provide evidence supporting any of their claims except for direct copyright infringement.
A US judge has largely sided with OpenAI in a lawsuit brought by authors alleging that ChatGPT was trained using pirated copies of their books. The judge dismissed most claims except for direct copyright infringement. While authors failed to show ChatGPT outputs were substantially similar to their works, one unfair competition claim was allowed to proceed based on OpenAI allegedly using copyrighted works without permission. This case has been quiet for a while because I think the remaining claims were consolidated into the Tremblay v. OpenAI, Inc. case being overseen by the same judge.
Thomson Reuters wins landmark U.S. AI copyright case, potentially establishing legal precedent
Thomson Reuters has won the first major AI copyright case in the United States. In 2020, the media and technology conglomerate filed an unprecedented AI copyright lawsuit against the legal AI startup Ross Intelligence. In the complaint, Thomson Reuters claimed the AI firm reproduced materials from its legal research firm Westlaw. Today, a judge ruled in Thomson Reuters’ favor, finding that the company’s copyright was indeed infringed by Ross Intelligence’s actions.
Last month, Thomson Reuters achieved a significant legal victory by winning the first major AI copyright case in the United States. Notably, the court rejected the notion that using content to train a foundational language model was not fair use. This case sets a precedent in the ongoing discussions surrounding copyright laws and artificial intelligence, and its outcome may influence how AI-generated content is treated under copyright law.
Microsoft guarantees legal protection for Copilot users from copyright lawsuits
Some customers are concerned about the risk of IP infringement claims if they use the output produced by generative AI. This is understandable, given recent public inquiries by authors and artists regarding how their own work is being used in conjunction with AI models and services. To address this customer concern, Microsoft is announcing our new Copilot Copyright Commitment. As customers ask whether they can use Microsoft’s Copilot services and the output they generate without worrying about copyright claims, we are providing a straightforward answer: yes, you can, and if you are challenged on copyright grounds, we will assume responsibility for the potential legal risks involved.
In mid-2023, Microsoft announced a "Copilot Copyright Commitment" to address customer concerns regarding potential copyright infringement when using its AI-powered tools. The commitment includes indemnity in cases where customers are sued for copyright infringement, provided they have implemented necessary guardrails and content filters. The company acknowledges the need to respect authors' rights and aims to balance innovation with protecting creative works. This either says something about how Microsoft trained its foundational models with all copyright-free and licensed content, or that Microsoft believes its lawyers are better than everyone else's.
Meta's training of its AI with pirated LibGen books sparks legal and ethical debate
Court documents released last night show that the senior manager felt it was “really important for [Meta] to get books ASAP,” as “books are actually more important than web data.” Meta employees turned their attention to Library Genesis, or LibGen, one of the largest of the pirated libraries that circulate online. It currently contains more than 7.5 million books and 81 million research papers. Eventually, the team at Meta got permission from “MZ”—an apparent reference to Meta CEO Mark Zuckerberg—to download and use the data set.
The article discusses the ethical and legal implications of Meta's use of pirated books from Library Genesis (LibGen) to train its AI model, called Llama 3. Faced with high costs and slow licensing processes for acquiring legal texts, Meta employees opted to access LibGen, which contains over 7.5 million books and 81 million research papers. Internal communications revealed that Meta acknowledged the medium-high legal risks of this strategy and discussed methods to mask their activities, including avoiding the citation of copyrighted materials. The communications were part of a motion for partial summary judgement in a lawsuit against Meta. And Meta is not going quietly—in response to that motion, it has filed dozens of documents on the court docket.
Nvidia denies copyright infringement in use of shadow libraries for AI training
Nvidia seemed to defend the shadow libraries as a valid source of information online when responding to a lawsuit from book authors over the list of data repositories that were scraped to create the Books3 dataset used to train Nvidia's AI platform NeMo. That list includes some of the most "notorious" shadow libraries—Bibliotik, Z-Library (Z-Lib), Libgen, Sci-Hub, and Anna's Archive, authors argued. However, Nvidia hopes to invalidate authors' copyright claims partly by denying that any of these controversial websites should even be considered shadow libraries.
Nvidia is the company making the news for creating the GPUs that are so popular with companies training foundational models. In creating their own model, they say that using "shadow libraries" like Z-Library and Library Genesis does not necessarily violate copyright law, and that its AI training process is a "highly transformative" fair use of the content. On the other hand, authors have argued that the AI models are derived from the protected expression in the training dataset without their consent or compensation. Nvidia's position seems pretty gutsy...admit that you are using copyrighted content, and arguing that such use is okay. A ruling against them would take a bit bite out of their sky-high stock market valuation. The case is currently in the discovery phase.
Getty images launches AI image generator using its licensed library
Generative AI by Getty Images (yes, it’s an unwieldy name) is trained only on the vast Getty Images library, including premium content, giving users full copyright indemnification. This means anyone using the tool and publishing the image it created commercially will be legally protected, promises Getty. Getty worked with Nvidia to use its Edify model, available on Nvidia’s generative AI model library Picasso.
In 2023, Getty Images launched a AI image generation tool that uses its vast library of licensed images. The company says users of its output have full copyright indemnification for commercial use. Developed in partnership with Nvidia (yes—the same Nvidia mentioned in the article above) and leveraging the Edify model, this tool allows users to create images while being protected legally. Getty plans to compensate creators whose images are used to train the AI model and will share revenues generated from the tool. Unlike traditional stock images, AI-generated photos will not be included in Getty’s existing content libraries.
This Week I Learned: "But where is everybody?!?" — the origins of Fermi's Paradox
The eminent physicist Enrico Fermi was visiting his colleagues at Los Alamos National Laboratory in New Mexico that summer, and the mealtime conversation turned to the subject of UFOs. Very quickly, the assembled physicists realized that if UFOs were alien machines, that meant it was possible to travel faster than the speed of light. Otherwise, those alien craft would have never made it here. At first, Fermi boisterously participated in the conversation, offering his usual keen insights. But soon, he fell silent, withdrawing into his own ruminations. The conversation drifted to other subjects, but Fermi stayed quiet. Sometime later, long after the group had largely forgotten about the issue of UFOs, Fermi sat up and blurted out: “But where is everybody!?”
This retelling of the Fermi Paradox coms from this story about why, despite the vastness of the universe, we have yet to encounter evidence of extraterrestrial civilizations. Enrico Fermi famously posed the question, "Where is everybody?" suggesting a disconnect between the expectation of abundant intelligent life and the lack of observable evidence. The concept of the Great Filter is introduced, proposing that there may be significant barriers preventing intelligent life from becoming spacefaring. The article goes on to speculate where we are relative to the "Great Filter" — are we past it, or is it yet in front of us? In other words, have we survived the filter or is our biggest challenge ahead of us?
What did you learn this week? Let me know on Mastodon or Bluesky.