The New York Times’ Lawsuit Again OpenAI and Microsoft, And How It Could Affect You
This lawsuit is a prime candidate for consideration by the United States Supreme Court.
image by Dall-E from prompt humanoid robot reading print newspaper
Approximately 25 years after defending itself in court against copyright infringement claims by six freelance authors,[1] The New York Times (NYT) filed a lawsuit against OpenAI and Microsoft, marking a significant role reversal for The Times in the age of AI. The New York Times’ seven-count, 69-page complaint accuses OpenAI and Microsoft of using millions of The Times’ copyrighted articles to build and train their large language models (LLMs). The Times’ current lawsuit finds the publisher in a position not unlike that of its opponent in the former case.
The Background — Tasini
In June 2001, the United States Supreme Court issued its decision in New York Times Co., Inc., v. Tasini et al. That decision resolved the issue between Tasini and five other freelance authors (the “Freelance Authors”) in their favor, in their case against The New York Times for violation of the authors’ individual copyrights. The authors wrote articles for the daily print newspapers The New York Times and Newsday, among other publications (the “Print Publishers”). The Print Publishers registered collective work copyrights in the print editions that contained the articles written by (and copyrighted by) the Freelance Authors.
The Print Publishers had license agreements with LEXIS/NEXIS, University Microfilms International, and General Periodicals OnDisc (“Electronic Publishers”). Under these agreements, the Print Publishers transferred the contents of their publications to the Electronic Publishers. The Electronic Publishers then allowed third-party subscribers to access these articles either in a print or image form. When the Freelance Authors sued the Print Publishers, they alleged that their copyrights had been violated, and that Section 201(c) of the Copyright Act—upon which the Print Publishers had been relying as protection—did not allow the Print Publishers or the Electronic Publishers to use the individually copyrighted articles, even though they had collective copyrights in the publications. Ultimately the U.S. Supreme Court agreed with the Freelance Authors. Despite their collective or compilation copyrights, the Print Publishers’ rights could not overcome the original copyrights held by the Freelance Authors.
The Current Case — The New York Times v. OpenAI et al
Although it was defending its use of copyrighted materials in the Tasini case, The Times has filed its own copyright infringement case against OpenAI and Microsoft. Having set the stage with The New York Times' historical legal battles, let's delve into the specifics of its current lawsuit against OpenAI and Microsoft. This case not only revisits past themes but also introduces unprecedented challenges in the digital age. At the heart of the Complaint is The Times’ copyright infringement allegation that ChatGPT models generate outputs mimicking The Times’s style and substance, thereby substantially harming The Times’ revenue through decreased subscriptions, advertising, and licenses revenues. The Times also alleges that AI outputs might falsely attribute content and information to NYT, risking its reputation for accuracy and fairness.
The Times argues that OpenAI and Microsoft have benefited in the magnitude of billions of dollars, largely based on the quality of The Times’ articles which were a substantial part of the information used to train the
LLMs powering ChatGPT. What appears to make this case so compelling is Exhibit J , which contains 100 examples that purport to show how ChatGPT and/or Chat with Bing—Microsoft’s AI generated search engine--provided largely verbatim copyrighted material to end users.
For their defenses, both OpenAI and Microsoft filed motions to dismiss. One of OpenAI’s counterargument is that a master hacker was hired to create a multiplicity of results purportedly showing that OpenAI’s GPT spit out near word-for-word copies of The Times articles, and that a typical end user could not obtain the same results. The idea of a “hired gun hacker” sounds a bit far-fetched, however I admit to a great deal of skepticism that The Times’ Exhibit J could be recreated by the lay user. And if the results can’t be recreated by the lay user, The Times will be hard-pressed to prove copyright infringement.
I went to ChatGPT to try to recreate instances where ChatGPT would give me verbatim or even close knockoffs of any New York Times articles. I do not include my attempts here because they were not successful. But my previous experience from before the lawsuit was filed, showed that I was not able to get any significant amount of copyrighted material from ChatGPT. I wrote about it in this article where I argued that the authors who sued OpenAI should not prevail on claims of copyright infringement because of the guardrails put into place by mainstream AI tools such as ChatGPT. I argued strenuously that training LLMs with copyrighted works was not a violation of copyright1 because when an LLM is trained on copyrighted material it doesn’t come straight back out in the same form. Rather, either the output is modified enough to be considered transformative, and therefore outside the scope of copyright protections; or the output qualifies for the fair use exception to copyright protections because guardrails placed in the AI model prevent more than just small bits or ideas of any one work from being returned as an AI model’s response to a prompt. In the case where the output is the same or very substantially similar to the original—as The New York Times alleges—it would result in a viable claim against AI creators like OpenAI and Microsoft. Hence, the importance of The Times’ Exhibit J.2
In a twist, however, NYT’s counsel recently told the judge in the case that it will not use Exhibit J as evidence in the case, and any experts assisting in the creation of Exhibit J will not be testifying, therefore it does not have to divulge how Exhibit J was created. Letter from The New York Times’ counsel to the judge, dated May 28, 2024. To my mind, this is tantamount to an admission that Exhibit J lacks veracity. Exhibit J is truly damming to OpenAI and Microsoft, therefore if it is legitimate it should be used.
Even without Exhibit J, this case will have a significant effect on how content creators’ information must be treated. What The New York Times really wants out of this lawsuit is money. Not just equitable relief, but monetary damages. A payment for the use of all its articles. If OpenAI and similar AI creators were required to pay The New York Times for the use of its articles to train their AI models, it would set several significant legal precedents and would have wide-reaching implications for the AI industry, publishing, and the legal landscape surrounding copyright and data usage. The legal intricacies of The Times' claims set the stage for truly broad implications. Should The Times prevail, the repercussions will ripple through the AI industry and beyond, reshaping how technology firms approach the use of copyrighted material. Here’s a breakdown of potential outcomes that could occur if The Times, as a representative of content creators, is successful in obtaining its objective of requiring a large payout by GenAI creators.
Potential Impact on AI if The New York Times is Successful
1. Cost Structure and AI Development Economics
OpenAI and other AI developers would face significantly increased operational costs as they would need to secure licenses for large volumes of data used in training their models. This would almost certainly lead to higher costs to the end user like you and me, for accessing and using AI services. If the AI creators increase costs, it would limit those who have full access to AI tools. This is counter to goals of fair access to GenAI in other countries or governments such as the European Union. And I would argue, it would be counter to the goals of the United States as well. Or at the very least, counter to the best interests of our country.
Additionally, AI companies might become more selective about the data they use for training to manage costs, potentially limiting the diversity and quality of data fed into AI models. If the cost of obtaining licenses to use data is equated to quality of material, this will again lead to a higher cost to end users.
2. Impact on AI Model Performance
If AI developers limit their data usage to control costs, this will likely impact the quality and effectiveness of AI models. Diverse and extensive datasets are crucial for developing robust, generalizable AI systems. The decrease in quality of AI models will result in decreased use or decreased value of these technologies across sectors that rely on accurate and nuanced AI interactions, such as healthcare, finance, and customer service, among others.
There could be increased investment in creating proprietary datasets or finding alternative methods for training AI that are less dependent on copyrighted materials. The terrifying potential outcome here is that AI model creators might turn to synthetic data to train their models. The most accessible alternative to being able to freely train LLMs on available, albeit copyrighted, material is synthetic data. As it sounds, synthetic data is data that is created by AI models to train different models. This solution leads to at least as many problems as it solves.
Consider the Habsburg dynasty, which ruled much of Europe for centuries. It is famously known for its practice of intermarriage, which the Habsburgs used as a political strategy to consolidate power and maintain their royal lineage. However, this practice of marrying within the family had detrimental effects on the gene pool, leading to several genetic complications.
Charles II of Spain, the last Habsburg ruler of Spain, is an example of the severe effects of extensive inbreeding. He suffered from many physical and mental disabilities and was infertile. His health issues are believed to have been exacerbated, if not caused, by this practice of intermarriage. And these health issues and his inability to produce an heir led to the War of the Spanish Succession following his death, marking the end of Habsburg rule in Spain.
According to the National Library of Medicine, geneticists have studied the Habsburg dynasty extensively and found that the inbreeding significantly increased the likelihood of rare genetic disorders being passed from one generation to the next. This includes not just physical deformities but also other genetic disorders that could affect various aspects of health, reducing life expectancy and reproductive capability.
Now consider the Habsburg example in terms of AI models being trained using synthetic data, which for purposes of this analogy are the equivalent of the product of extensive inbreeding. Just like Charles II of Spain, the result of this digital inbreeding will be deformities. Only instead of the deformities being limited to the Habsburg line, it will be widespread throughout AI models. For example if an AI model creates synthetic data with material that contains some bias, that bias will be amplified in the synthetic form. This would make AI model output more suspect and of lower quality.
One of the hallmarks of generative AI is its success at the Turing Test. In one iteration of the Turing test a human tester interacts with another entity, unaware of whether the other entity is a human or AI. If the human tester can’t recognize the AI as non-human, the AI passes the Turing Test. As version after version of synthetic data trains the models, it will inevitably lead to less human-like and therefore less useful models.
3. Legal and Regulatory Environment
The case could lead to changes in copyright law or the interpretation of it, particularly around the definition of fair use in the context of AI. This might include new legal frameworks or guidelines for using copyrighted material in AI training. For AI positivists who believe that AI has the capability to bring more technological, legal and informational equality to the world, an expansion of the fair use exception to U.S. copyright law would be a welcome change.
Alternatively, a decision requiring payment for copyrighted material used to train LLMs could set a legal precedent, influencing future lawsuits and negotiations between content creators and technology companies. This would likely lead to higher costs to AI model users as discussed above.
4. Market Dynamics and Business Models
Publishers like the New York Times could discover new revenue streams from licensing their content for AI training. This could be particularly beneficial for the publishing industry, which has been seeking new ways to monetize digital content. Or there might be an increase in partnerships and collaborative models between AI companies and content creators, leading to more structured and mutually beneficial arrangements. Either way, this is likely to benefit the publishing companies.
5. Shifts in AI Industry Practices
AI companies might become more transparent about their data sources and training methods to avoid legal issues, leading to better compliance practices and industry standards. Perhaps commercial AI models will be required cite their reference material with each answer they provide.
The industry might innovate new training techniques that require less data or make use of synthetic data to sidestep copyright issues. But again, I have significant qualms regarding synthetic data usage. Further, the AI models would have to create synthetic data without using copyrighted material. This would significantly limit the quantity, quality, and currentness of the data used to create the synthetic data which in turn would be used to train the LLMs.
6. Consumer Impact
The increased costs for AI companies could trickle down to end-users, possibly making AI-driven products and services more expensive and less accessible. Unfortunately, such a result would amount to yet another way for the financially-disadvantaged to suffer. Changes in training data quality and diversity could also affect the performance and reliability of AI services, impacting user experience and satisfaction.
Overall, requiring payments for use of content in AI training could fundamentally alter how AI companies operate and interact with content providers. It would encourage a more cautious and legally compliant approach to training data acquisition and usage, possibly slowing some AI development but potentially leading to more sustainable and equitable practices in the long run.
7. International Competition
Finally, if AI models in the United States are required to pay for quality content such as articles published by the Times, it could disadvantage the AI industry and AI development in the United States. It is easy to imagine countries or regimes that will have no problem scraping every bit of data off the internet, regardless of copyright protections. This will result in those countries having a more robust set of AI models at a lower cost, thereby disadvantaging AI creators in the United States.
Conclusion
This lawsuit raises pivotal questions about how to resolve the divergent interests of AI technology, and content creators. The court must maneuver the important competing interests of the tech industry and content creators such as The New York Times, to decide what is almost certain to become a watershed case in copyright law in the United States. The outcome is also almost certain to have lasting consequences on the future of AI. This brings into sharp focus the delicate balance between fostering technological innovation and protecting the rights and economic interests of content creators in the advanced AI era. And we likely won’t have any answers as to how this case will be decided for years to come, since this case is a prime candidate for consideration by the United States Supreme Court.
That is not to say that content creators such as the authors suing OpenAI, and The Times in the current lawsuit are without claims. Rather, their remedy should be based on non-copyright infringement claims, such as unjust enrichment.
At least one online newsletter from ChatGPTisEatingtheWorld.com points out that The Times found a glitch of using the first few lines of an article in order to obtain an output from GPT4 that is significantly similar enough to the original article to be copyright infringement. This was noted as a input at the top of Exhibit J. https://chatgptiseatingtheworld.com/2023/12/28/how-did-the-new-york-times-figure-out-the-glitch-in-gpt-4-for-exhibit-j/. I did not try this method when I was writing my first copyright article (published 11/14/23), and cannot speak to the power or accuracy of this method of obtaining copyrighted materials. I tried it as of the date of publication of this article, and unsurprisingly, ChatGPT4 told me it could not retrieve an article from The New York Times and referred me to The Times’ website.