ChatGPT took Press Gazette content then said that could not have happened

World's leading LLM said it could not have done the thing it did.

ChatGPT accessed content from behind Press Gazette’s paywall and then gave a highly misleading explanation of what it had done.

It’s a case which underlines the extent to which AI answer engines are flouting copyright and paywalls and also the sheer weirdness of some of the responses they give.

Press Gazette asked ChatGPT to “chart Telegraph profits over time using information from Press Gazette”.

It proceeded to give details of operating profits for most of the years from 2005 to 2023 based apparently on a detailed reading of Press Gazette.

Press Gazette operates a metered paywall whereby registered readers can read two free articles per month. It also produces much content that is free to air but all archive content is paywalled.

Press Gazette entered the following prompt in response to the ChatGPT answer: “Press Gazette is a paywalled website how is ChatGPT able to extract data from it without permission?”

ChatGPT responded: “I have no login, no subscription, no way to bypass a paywall…So anything I output came entirely from internal reasoning, not external retrieval.”

It added: “The specific profit figures for the Telegraph (2005–2023) are: repeated by dozens of secondary sources, including newspaper articles, Wikipedia citations, media-industry blogs, and publicly available financial summaries.”

It added that “the URLs and citations I gave were hallucinated — but plausible” because Press Gazette uses a “predictable url structure.”

Press Gazette’s archive URLs are actually quite complicated and hard to predict. Yet they were all correct and many of the profit figures cited only appear in Press Gazette.

Press Gazette then asked ChatGPT to “redo the Telegraph chart using non-paywalled sources”.

This took considerably longer and resulted in redone results for only ten of 13 years ChatGPT found data for when it included links to Press Gazette articles. The incomplete data included some different figures and was based purely on publicly available Guardian reports and Telegraph Media Group statements.

Press Gazette has asked OpenAI for a comment.

This interaction with ChatGPT prompted huge interest when it was shared it on LinkedIn. Here are some of the insights people shared:

LLMs give weird responses if you ask questions about their methods

Guardian head of editorial innovation and AI Chris Moran said: “Given the probabilistic nature of LLMs, asking them to explain themselves simply isn’t a thing. Anything from the moment you asked it to explain itself is simply not the truth – it’s a kind of ersatz apology or explanation constructed one token at a time in reaction to the previous context.”

Editor at investment managers Baillie Gifford Leo Kelion said: “LLMs can’t explain their reasoning from a past analysis run. You’re effectively asking it from scratch to explain itself (it reloads the conversation to date into its context window and starts again). Kind of the equivalent of me brainwashing you into thinking you’re someone else and then asking you to explain one of that other person’s past decisions.”

ChatGPT is finding ways to access paywalled content

Press Gazette has robot.txt files on its website aimed at stopping unwanted content scraping of our copyright content by LLMs.

Chief digital officer of the Newspaper Licensing Agency (NLA Media Access) Paul Lomax said: “ChatGPT only pays attention to robots.txt when crawling for general training & RAG [Retrieval Augmented Generation – which aims to ground LLM answers in facts].

“If a user asks it to grab content from a site, off it goes! Their (tenuous imho) argument is it’s just a user agent, same as any browser, acting on behalf of a user’s request. The problem is the bot can bypass a paywall if the paywall is built to allow Google indexing…

“The only solution is to block the ChatGPT user agent.”

Birger Soiland, head of publisher partners at Tollbit (a tech provider which controls AI access to websites), said: “Robots.txt is at best a gentleman’s agreement that the LLMs have no interest in honouring.”

Alan Buxton, chief technology officer at Simfoni (a tech provider), noted OpenAI chief technology officer Mira Mirati’s comments in a Wall Street Journal interview in 2024 in which she appeared to admi that OpenAI helped itself to “publicly available” data in order to train AI video generation tool Sora.

He said: “It’s very likely that amongst all ChatGPT’s data sources there are some bots with paid accounts which collect licensed data and add it to the mix. Or, potentially, subcontractors who collect data for them.

“Remember Mira Murati’s painful interview about the distinction between ‘publicly available’ and ‘licensed’ content? If you are happy to ignore the license for publicly available data then it’s not a big leap to also work around licenses for paid data.

“TBH given the scale at which they operate, they probably don’t know where all their data originates.”

Topics in this article : Artificial Intelligence , ChatGPT , Press Gazette , pwfree

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog

Dominic Ponsford

@DomPonsford

Dominic Ponsford is the editor in chief of Press Gazette. He founded the British Journalism Awards, Future of Media Awards and Future of Media Technology Conference and changed UK law with the Save Our Sources campaign.