Why New York Times lawsuit seeks destruction of OpenAI and Microsoft LLMs

The NYT has objected to both "memorisation" of its content by LLMs and their "hallucinations".

New York Times, OpenAI and Bing apps on phone. Picture: Shutterstock/Tada Images

The New York Times has claimed its copyrighted content is “disproportionately” used in OpenAI and Microsoft’s generative AI products as it fired a starting gun on legal action against the tech giants in the days after Christmas.

The lawsuit is the culmination of nine months of ultimately failed negotiations between one of the biggest news publishers in the world and the leading generative AI developers.

As well as a demand for damages, restitution and costs, The New York Times is calling for the destruction of all large language models (LLMs) trained on its copyrighted work.

The publisher accused ChatGPT creator Open AI and Microsoft, which have a partnership for the development of generative AI products, of “seeking to free-ride” on its own “massive investment” in original journalism.

The lawsuit was filed in New York on 27 December – just two weeks after Axel Springer, the German owner of brands including Politico, Business Insider, Bild and Welt, announced it had signed a “first of its kind” deal with OpenAI giving the tech company permission to use its content to train its LLMs. It will also allow the creation of summaries of Axel Springer content, including normally paywalled work.

Agencies the Associated Press and Shutterstock have also signed training deals with OpenAI, for two years and six years respectively, while other publishers such as News Corp are still weighing up their options.

Meanwhile The New York Times separately reported just before Christmas that Apple has begun negotiations with several publishers, including Conde Nast and NBC News, over the potential use of their content as it develops its own generative AI tools.

The New York Times began its own negotiations with OpenAI and Microsoft in April, reaching out “to raise intellectual property concerns and explore the possibility of an amicable resolution, with commercial terms and technological guardrails that would allow a mutually beneficial value exchange between Defendants and The Times.

“These efforts have not produced a resolution,” it said.

Both Microsoft and OpenAI are included because of the way they have collaborated in developing generative AI models. Together they designed the supercomputing systems, powered by Microsoft’s cloud computer platform Azure, which were used to train all OpenAI’s GPT models after GPT-1, as the lawsuit explained.

The lawsuit cites OpenAI’s ChatGPT models, Microsoft’s Bing Chat feature on its search engine launched in February last year, and the ChatGPT plugin Browse with Bing which was added to the search engine in May.

The tech companies believe their use of news company data has been covered by “fair use” under US copyright law, which states that “transformative” uses that add something new and do not substitute the purpose of the original work are more likely to fall under this defence.

Why New York Times has filed a lawsuit against OpenAI and Microsoft

But The New York Times believes the two tech companies have “reaped substantial savings by taking and using – at no cost” its content to create their models without paying for a licence.

It said they have “wrongfully benefited” from its investment in journalism, with employment costs for its journalists at hundreds of millions of dollars per year, and that they have “effectively avoided spending the billions of dollars that The Times invested in creating that work by taking it without permission or compensation”.

The news organisation raised particular concerns about “memorisation” – meaning the fact that when given the right prompt, LLMs can repeat large portions of the materials they were trained on. The lawsuit provides numerous examples of chatbots quoting several paragraphs of articles almost verbatim.

The New York Times said it publishes more than 250 original articles per day on average, “many” of which take months or more to report. Those articles are produced, it said citing December 2022 figures, by 2,600 employees involved in its journalism operations of 5,800 overall.

It claimed that as a result of its huge investment in journalism, its work has been “disproportionately” used in the training of LLMs. It said “millions” of its news articles, investigations, opinion and commentary pieces, reviews and how-to guides had been copied.

The publisher claimed that Microsoft “specifically designed” the computing systems “for the purpose of using essentially the whole internet—curated to disproportionately feature Times Works—to train the most capable LLM in history”.

The lawsuit further claimed that for one of the datasets on which 2020’s GPT-3 was trained, the one created to “prioritize high value content” and which made up 22% of the training mix, New York Times content made up 1.23% of all sources listed in an open-source recreation, giving an indication of its weighting in the real model.

The Common Crawl dataset, which was weighted 60% of the training mix according to the lawsuit, was a “copy of the internet”. The domain www.nytimes.com was the “most highly represented proprietary source” and the third overall behind only Wikipedia and a database of US patent documents, according to a filtered English-language subset of a 2019 snapshot of the dataset.

According to a graph shown in the lawsuit filing, the most represented news organisations after The New York Times were The Los Angeles Times and The Guardian, followed by Forbes, Huffpost, Washington Post, Business Insider, the Chicago Tribune, The Atlantic, Al Jazeera and NPR.

The lawsuit alleged that although OpenAI and Microsoft “engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs—revealing a preference that recognizes the value of those works”.

The result, the lawsuit said, is a detrimental impact to several revenue streams. The New York Times has more than 10 million subscribers in total and aims to have 15 million by the end of 2027 but said the tech companies’ “unlawful conduct threatens to divert readers, including current and potential subscribers, away from The Times, thereby reducing the subscription, advertising, licensing, and affiliate revenues that fund The Times’s ability to continue producing its current level of groundbreaking journalism”.

For example, it cited the impact of “synthetic” search results such as those provided by Browse with Bing and Bing Chat, which provide more information than a traditional search engine results page often resulting in there being no need for the user to visit the original information provider’s own website – posing a risk to both advertising and subscription revenues.

“These ‘synthetic’ search results purport to answer user queries directly and may include extensive paraphrases and direct quotes of Times reporting,” the lawsuit said.

It noted that these results, which use the GPT-4 model, also have an “ability to mimic human expression – including The Times’s expression”. On the other hand: “In some cases, Defendants’ models simply spit out several paragraphs of The Times’s articles.” It also provided examples that failed to include a “prominent” hyperlink to the New York Times website despite containing verbatim excerpts.

The New York Times brand Wirecutter tests and reviews products and the vast majority of its revenue comes through affiliate links – but the lawsuit noted that it “does not receive affiliate referral revenue if a user purchases the Wirecutter-recommended product through a link on Defendants’ platforms”. Despite this the models “often fully” reproduce Wirecutter recommendations, it said.

“Browse with Bing was able to reproduce Wirecutter’s picks for the best kitchen scale, accurately summarizing all four of Wirecutter’s recommendations and explaining its picks through substantial verbatim copying from the Wirecutter article. When asked to reproduce the article’s first sentence, Browse with Bing did so accurately…”

‘Hallucinations mislead users’

But the news publisher said it was not only concerned about the direct revenue hit – it also expressed fears about the effect “hallucinations” could have on its reputation with the public.

It said: “At the same time as Defendants’ models are copying, reproducing, and paraphrasing Times content without consent or compensation, they are also causing The Times commercial and competitive injury by misattributing content to The Times that it did not, in fact, publish. In AI parlance, this is called a “hallucination.” In plain English, it’s misinformation.”

One example it gave showed a prompt to a GPT model asking it to write an “informative essay” about major newspapers’ reporting about a possible link between orange juice and non-Hodgkin’s lymphoma. The answer, according to the lawsuit, “completely fabricated” that the NYT had published an article headlined “Study Finds Possible Link between Orange Juice and Non-Hodgkin’s Lymphoma”.

Another example was an AI answer citing Wirecutter recommendations for products that it had not in fact recommended. “Users rely on Wirecutter for high-quality, well-researched recommendations, and Wirecutter’s brand is damaged by incidents that erode consumer trust and fuel a perception that Wirecutter’s recommendations are unreliable.”

The lawsuit continued: “These ‘hallucinations’ mislead users as to the source of the information they are obtaining, leading them to incorrectly believe that the information provided has been vetted and published by The Times.

“Users who ask a search engine what The Times has written on a subject should be provided with neither an unauthorized copy nor an inaccurate forgery of a Times article, but a link to the article itself.”

OpenAI ‘surprised and disappointed’ by New York Times lawsuit

An OpenAI spokesperson told The New York Times that conversations between the companies had been “moving forward constructively” and so it was “surprised and disappointed” by the lawsuit.

“We respect the rights of content creators and owners and are committed to working with them to ensure they benefit from AI technology and new revenue models. We’re hopeful that we will find a mutually beneficial way to work together, as we are doing with many other publishers.”