Issue 99: Copyright for Generative Artificial Intelligence (ChatGPT, DALL·E 2, and the like)
This issue is offered in honor of Cecil Mae Thornburg Feather, my mother-in-law. Cecil Mae was a wonderful person. I only knew her a short time as I married into the Feather family, and that time was filled with love and joy. She enjoyed playing piano and teaching students how to play piano. My own two children spent summers in their Hickory, North Carolina, home and came back with new tunes on their fingers and new stories in their hearts. I remember her warm smile and even warmer hugs. She taught me that southern hospitality is not only a stereotype but a perspective to be admired and modeled wherever I am. If I may borrow from the Jewish tradition, may her memory be a blessing to all who knew her.
This week we look at the intersection of the hot topic of artificial intelligence (AI) and copyright law. Can works created by an AI algorithm be copyrighted? Do the creators of AI models have an obligation to respect the copyright of works they use in their algorithms?
The rush of new AI tools to the public has quickly inflamed these questions. There seems to be little doubt that the output of AI algorithms cannot be copyrighted. There is little clarity about the legality of AI algorithms using copyrighted material.
- Copyright is for humans
- Copyright Office rejects AI Art
- What is a "large language model" (LLM) artificial intelligence system?
- How do LLMs work with images?
- Models of unimaginable complexity
- Getty Images goes after Stable Diffusion
- Maybe it isn't so magical after all?
- Open source coders sue GitHub owner Microsoft and Microsoft's partner OpenAI
There is really much more to be said on the topic, but this will do for one Thursday Thread. Let me know if you have seen other angles that you think should be more broadly known.
Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.
Copyright is for humans
It might be useful to start here—copyright is recognized as a protected right for humans only. In this case, PETA argued on behalf of a selfie-snapping Indonesia monkey named Naruto that the monkey held the copyright to an image. (Slater, a photographer, left his equipment unattended, and Naruto snapped this picture.) The court held "that the monkey lacked statutory standing because the Copyright Act does not expressly authorize animals to file copyright infringement suits."
Discussion of whether the output of a generative AI system can itself be copyrighted hinges around Naruto v. Slater, and most everyone I've read said that the result of an algorithm similarly can't be copyrighted.
Copyright Office rejects AI Art
As expected, the U.S. Copyright Office rejected an application for a work from an algorithm. In fact, the Copyright Office has started the process of revoking a previously granted copyright for an AI-generated comic book.
I'm starting here because it is helpful to know whether the output of an AI system can be copyrighted when we later look at the use of copyrighted sources in AI.
What is a "large language model" (LLM) artificial intelligence system?
The type of artificial intelligence that has been of great interest recently is classified as "large language model". Simplistically: they analyze tremendous amounts of texts—the entire contents of Wikipedia, all scanned books that can be found, archives of Reddit, old mailing lists, entire websites...basically, anything written on the internet—and derive a mathematical model for determining sequences of words. Then, when fed a string of words as a prompt, it looks at the statistical model to see what comes next. (Powerful stuff! I recommend reading Shanahan's 12-page arXiv paper to get a fuller sense of what LLMs are about.)
We see the output of that in text form with ChatGPT. But what about the image-generating systems?
How do LLMs work with images?
In image form, the linking of text descriptions of images found on the internet (such as found in HTML "alt" attributes or in the text surrounding the image in a catalog) is what the algorithm uses to generate new images.
So, going back to Naruto v. Slater, we're pretty sure that the output of these algorithms and statistical models can't be copyrighted. But were the copyright holders' rights violated when their text and images were used to build the statistical models? That is the heart of the debate happening now.
Models of unimaginable complexity
Returning to Shanahan's paper (see, I told you it was worth reading), we learn that the algorithms are more than just copy-and-paste. That is what makes them seem so magical. Is that magic creating a new derivative work?
Most of the lawsuits probing this question seem to be happening with images and software code. For example, this one from Getty Images.
Getty Images goes after Stable Diffusion
A rich source of images and descriptions about images can be found in the Getty Images catalog.
The algorithm is so uncanny that it reproduces what looks like the Getty Images watermark in the derived image. Getty Images alleges three things.
- Removed/altered Getty Image's "copyright management information" (the AI-generated visible watermark resembles that of Getty Image, so these photos must have been taken from them)
- False copyright information (modification of the photographer's name)
- Infringing on trademark (a very similar watermark implies Getty Images affiliation)
The case is in front of the U.S. District Court in Delaware.
Maybe it isn't so magical after all?
This article summarizes the finding of researchers investigating whether it was possible to get the LLM algorithms to return known images in the dataset. With a unique enough prompt and training data set: yes, that seems quite possible.
Open source coders sue GitHub owner Microsoft and Microsoft's partner OpenAI
Microsoft bought GitHub in 2018 and Microsoft is a major investor and user of OpenAI's LLM technology. Copilot is a new feature in GitHub that generates code snippets based on the open source code files uploaded to GitHub and a prompt from the user. (Sound familiar?) The software developers claim that Microsoft's use of the code files violates the terms of open source license agreements. This is a new case, and it is one to watch to see how copyright and license terms intersect with large language models.
Roaar?
A man and his cat. Is there any more to life?