Major news publishers block the bots as ChatGPT starts taking live news

Independent Publishers Alliance urges members to block GPTBot and Google Bard crawler ASAP.

ChatGPT blocked. Picture: Shutterstock/LanKS

ChatGPT’s threat to news publishers looms larger than ever as it prepares to start reading up-to-date new stories- instead of relying on a database that has not been updated in two years.

The UK’s Independent Publishers Alliance is urging its members to block crawling access for OpenAI and Google as soon as possible while an AI strategist told Press Gazette it is a “tricky time” for publishers – especially if they are expected to opt out of each generative AI company separately.

Until now OpenAI’s ChatGPT was only able to use information up to September 2021, the cut-off date for its training database.

But paying ChatGPT Plus and Enterprise users can now get “current and authoritative information” in answers from the chatbot and this will be expanded to all users “soon”. OpenAI also promised to provide “direct links to sources”.

The change will mean users can ask ChatGPT questions relating to current affairs, with answers likely trained on content from news publishers across the world who will lose out on traffic if people find out what they want to know without ever having to go to the original source. It could prove an extension of the rise in “zero click searches” in which search engine results pages give users the answers they want directly without them needing to click through to articles that may have originated the information.

The move comes as publishers continue to grapple with whether to block ChatGPT’s bot, and equivalent crawlers from the likes of Google and Bing, from using their content to train datasets.

OpenAI first told publishers how to opt out of scraping in August while in the past few weeks both Google and Bing have explained to publishers how to similarly opt out of trawling – but, crucially, not get blocked from their search results.

Google and Bing let publishers opt out of AI training without losing out in search

Bing came first on 22 September, telling publishers they had created new ways for them to “have greater control over how their content is used in the AI era”. Microsoft-owned search engine Bing has added AI bot Bing Chat into search results, making use of OpenAI’s technology under a multi-year, multi-billion dollar investment. Bing Chat’s answers provide links to sources – many of which in Press Gazette’s queries so far appear to lead to Microsoft’s news aggregator MSN.

If publishers take no action, their content will continue to be used as sources for Bing Chat. Content tagged NOCACHE “may” be included in Bing Chat answers but only URLs, snippets and titles would be displayed and used in training the model. Content tagged NOARCHIVE will not be included, linked to or used for training purposes.

Bing added: “We also heard from publishers that they want to exercise these choices without impacting how Bing users can discover web content on Bing’s search results page. We can assure publishers that content with the NOCACHE tag or NOARCHIVE tag will still appear in our search results.”

Meanwhile Google conceded a week later that publishers had told it “that they want greater choice and control over how their content is used for emerging generative AI use cases”.

In response it has created Google-Extended, a tool for publishers to be able to control access to content on their sites and decide whether they “help improve” Google Bard, its AI-driven chat tool.

Google’s VP for trust Danielle Romain repeatedly emphasised the worth to the tech platform of publishers allowing their content to be used, writing: “By using Google-Extended to control access to content on a site, a website administrator can choose whether to help these AI models become more accurate and capable over time.”

Update: Search Engine Land has reported that Google-Extended does not stop Google’s new AI-generated answers in search results pages from using publishers’ content even if they believe they have opted out from all of its AI products.

‘Why let them take it for free?’

The UK’s Independent Publishers Alliance has recommended its members block ChatGPT from crawling their sites as soon as possible.

Reasons include cost – if the number of bot visits to a smaller publisher’s site increases significantly, they could be forced into a higher hosting bracket – and the prevention of plagiarism, which can result from the regurgitation of content via generative AI tools.

The alliance also believes there is more negotiating power for publishers to potentially get paid for their content if they opt out – and that allowing use for free could be used against them in any future legal action or licensing negotiations.

Chris Dicker, a board member for the Independent Publishers Alliance, told Press Gazette: “No one is generating any traffic from allowing it currently, so why let them take it for free with nothing in return?

“We believe the ability to block them (or a signal to say they shouldn’t use our content) will be used against publishers at a later time when regulators get involved – they may say why should we pay for that content when publishers are willing to give it to us for free? That argument is effectively ‘we gave everyone the option to say no to their content being used and publishers didn’t do anything’.”

Dicker, who is managing director at tech website Trusted Reviews, added: “Publishers have been here before with virtually all new big tech companies. They try and make it as attractive as possible to engage with them, normally direct deals with a select few larger publishers to allow them to do what they want, the others follow and then slowly but surely more and more gets taken away until it’s too late and you end up with a situation like the substantial decrease in traffic from Facebook this year or the increase in zero click searches in search.

“The strategy they all use is called ‘how to boil a frog’: you can’t just put a frog in boiling water as it will jump out, so you put it in nice warm water that it likes and then you turn the heat up until it’s too late.”

Dicker said now is a crucial time to make this decision: “If all sites chose to block OpenAI and/or Bard, then its knowledge would be stuck in 2021 and they would have to come to the table and negotiate with publishers for the use of their content. The fact ChatGPT is about to make its database up-to-date for all users makes now a critical time for publishers to force their hand on this.”

However, not all publishing bodies agree: Sajeeda Merali, chief executive of the Professional Publishers Association which represents large and small specialist media businesses, told Press Gazette in August there are disadvantages to opting out.

“[If] ChatGPT is to continue to grow and become an entry point for digital information in the same way that Google search is at the moment then opt-out isn’t really a viable option,” she said.

“What we don’t want to do is create barriers to negotiating the right terms with ChatGPT and we certainly don’t want them to be able to say that ultimately publishers can choose to do what they want.”

‘Inevitability’ that content is crawled and learned from

The pros of cons of blocking were discussed at the Digiday Publishing Summit in Florida last month. One publishing exec said they had “jumped early” on opting out but were now unsure: “I got humbled and thought I publish all of my content on eight different syndication apps and websites where this is also crawlable’… This is so discoverable in other places that aren’t on the page where I’ve deployed this blocker that I think it was kind of a wasted effort on my part. It’s an inevitability that this stuff is ingested and crawled and learned from.”

But they added that the decision could prove useful in future negotiations as a “starting point for the inevitable negotiations that we’ll have as publishers with OpenAI and other companies. We’ll be able to have that as a point of leverage and say, we’ll take it off if we can reach a deal or an agreement.”

Luke Budka, AI strategist at Definition, told Press Gazette it was a “tricky time for publishers when it comes to gen AI – lots of moving parts and easy ways to get things very wrong.

“For example, from a ‘crawling’ perspective, you can allow Googlebot to crawl your site but disallow Google-Extended, the bit of Googlebot used to scrape information to train Bard with. I’d bet £10 several publishers will block Googlebot by mistake and see their sites disappear from the search results altogether in an effort to prevent gen AI content harvesting.

“Likewise if you don’t want to help train Microsoft’s generative AI foundation models you now need to tag your content with NOARCHIVE. And separately to that you need to block OpenAI (GPTBot) and Anthropic separately.

“Some big names moved quickly to block OpenAI’s ChatGPT including the New York Times (hardly surprising as they’re suing them), Reuters, Bloomberg, CNBC and The Athletic; ABC blocked GPTBot and have already disallowed Google-Extended.”

Big news names block GPTBot

Since August, 44% of 1,123 news publishers monitored in a continual survey by open-source archiving system homepages.news have blocked out of ChatGPT’s trawls using robots.txt – the code that tells trawlers what parts of a website they are allowed to access.

UK publishers reported by homepages.news to have blocked GPTBot include the Daily Mail, The Sun, The Guardian, Belfast Telegraph, Daily Herald, Newsquest’s Daily Echo, The Economist and The National.

The Daily Mirror, The Times, The Telegraph, The Spectator, Daily Record, BBC, Belfast News Letter, Bellingcat, Evening Standard, The Independent, The i, New Scientist, New Statesman, Reuters, The Scotsman and Unherd all still allow their sites to be trawled.

In the US, major players like ABC News, Axios, New York Times and sister title The Athletic, Bloomberg, Boston Globe, CBS News, CNBC, CNN, New York Daily News, Deadline, E!, ESPN, Gawker, the Hollywood Reporter, Los Angeles Times, NBC News, New Yorker, Semafor, Slate, USA Today, Wall Street Journal, Washington Post and many local titles have all blocked GPTBot.

A separate tracker of 4,919 domains including more small UK sites by the Independent Publishers Alliance shows that 595 of those sites (12%) were blocking robots.txt for ChatGPT as of Sunday 1 October. The latest additions were the FT, The Sun and the i which all blocked the bot on Friday.

Of the 595 domains that have opted out of ChatGPT, one website is so far also blocking training for Google Bard: tech site Venture Beat.

Budka noted there is no definitive answer on whether it is better for publishers to block all AI training bots or not.

“Plenty of stakeholders have already discussed at length whether it’s more or less beneficial for news websites to allow their data to be scraped and used for training purposes,” he said.

“On the one hand, they feel they should be remunerated for providing training data (and several are brokering direct deals e.g. the AP) but on the other hand, much like with classic Google search, can they afford to be left out of AI-generated results as said results begin to power more of society?

“Either way, publishers are going to slip up if they have to add every single gen AI they want to block to their robots files – there needs to be a way to easily disallow all. But that requires governments to get their acts in gear – maybe we’ll see this in November’s UK AI ethics conference in Bletchley Park? I wouldn’t hold your breath.”