Fighting for quality news media in the digital age.

  1. Tech Platforms
February 27, 2024updated 08 Mar 2024 11:27am

Revealed: Which of the top 100 UK and US news websites are blocking AI crawlers

Of 106 sites checked, 45 have no AI crawlers blocked at all.

By Bron Maher

More than four in ten of the top 100 news websites in the English language allow all AI web crawlers to scrape their content, Press Gazette analysis has found.

Among the 106 websites listed in Press Gazette’s top 50 rankings for the UK, US and world in December, more than half have OpenAI’s bot for ChatGPT blocked.

Read on for the full list of which news publishers block which AI bots.

[Read more: Major news publishers block the bots as ChatGPT starts taking live news]

What is a web crawler, and why do some news sites block them?

Web crawlers, also known as spiders or bots, are programs which travel across the internet with the goal of saving or indexing it one page at a time. Search engines use web crawlers to identify the websites that make up the internet, and artificial intelligence companies use them to fetch information which they then feed into the large language models that underpin their chatbots.

Most news websites block or permit web crawlers by editing a page on their URL named robots.txt, for example pressgazette.co.uk/robots.txt. A recent article on The Verge delved into the history of this page and its significance to the internet. Robots.txt pages are in effect only advisory: a bot’s creator can, if they choose, instruct it to ignore the robots.txt page.

Web crawlers began receiving extra attention from the news industry last year after publishers realised their content had been used to help train large language models without their consent. Whereas some publishers like The New York Times and the BBC have responded by blocking AI web crawlers, others have left their websites open to the bots or brokered content licensing deals with AI companies.

[Read more: News publishers divided over whether to block ChatGPT]

Which news publishers have blocked which AI bots?

Robots.txt pages are publicly viewable, so Press Gazette was able to manually visit each of the 106 websites appearing on our three top 50 rankings and assess which AI company web crawlers, if any, those sites had blocked. A total of nine web crawlers associated with seven AI businesses were named in the sites’ robots.txt files, which are listed below:

  • GPTBot: the web crawler which feeds into ChatGPT, the OpenAI chatbot which kicked off the generative AI craze
  • Google-Extended: the crawler which feeds into Google’s AI chatbot Gemini (formerly named Bard)
  • Claude-Web, Claudebot and anthropic-ai: three crawlers which feed into Claude, the chatbot built by OpenAI rival Anthropic
  • Cohere-ai: the crawler for Cohere, an AI company which targets its chatbot at the business community
  • Perplexity-ai: the crawler for Perplexity, another ChatGPT competitor
  • Seekr: the crawler for Seekr, a company which builds large language models for a variety of purposes
  • Meltwater: the crawler for Meltwater, a media monitoring company incorporating some AI tools.

Of the 106 sites, 45 (42.5%) have no AI company bots blocked at all, versus 61 with at least one bot barred. There were 32 sites with two or more blocked, 16 with three or more, 11 at four or more and five with five crawlers blocked.

The only site on Press Gazette’s list with a blanket ban on almost all web crawlers regardless of their origin or purpose was news.google.com, the website of aggregator Google News. The only crawler allowed to scan the site is Googlebot, which indexes pages for Google search.

The site with the most named web crawlers blocked was another aggregator, MSN, which had six bots barred. The UK and US editions of the BBC website both had five crawlers blocked, and as did News UK-owned the-sun.com, thesun.co.uk and thetimes.co.uk. Two more News Corp titles, the New York Post and Wall Street Journal, were not far behind with four crawlers blocked apiece.

ChatGPT’s GPTBot was by far the most commonly blocked web crawler, being disallowed by 60 websites (56.6% of the total). That finding coheres with recent research from the Reuters Institute for the Study of Journalism, which found that at the end of 2023 48% of the most widely used news sites in ten countries were blocking the crawler.

The only website Press Gazette found that blocked some web crawlers but not GPTBot was Reuters, which blocked only the Google and Anthropic bots.

Approximately a quarter of the websites had Google-Extended blocked. Including Google News, only 17 websites (16%) had an AI crawler besides GPTBot or Google-Extended blocked.

Claudebot was blocked only by The New York Times and Seekr only by The Guardian. The next most niche exclusions were Perplexitybot, which is blocked by msn.com, CNBC and The Hill, and Meltwater, which is blocked by The Times and the UK and US editions of The Sun.

And which publishers aren’t blocking the bots at all?

While a modest majority of publishers have blocked some AI web crawlers from their sites, there are numerous major publishers who have not prohibited them.

Mirror, Express and Manchester Evening News publisher Reach for example allows all of its websites that Press Gazette checked to be crawled. The same is true of youth-focused websites Ladbible and Unilad and the Lebedev-owned Independent and Evening Standard.

Politico also does not block the bots, its parent company Axel Springer having struck a deal with OpenAI to feed its publications’ content into ChatGPT. Although that deal does not extend to the other AI companies, Politico’s SVP of product and design told Press Gazette last month that a new website redesign hopes to make politico.eu, its European edition, as readable to web crawlers as possible. (Curiously, fellow Axel Springer title Business Insider blocks both GPTBot and Google-Extended.)

A more surprising appearance on this list of titles which don't block any AI bots is the IAC-owned Daily Beast. IAC’s chairman, Barry Diller, has repeatedly publicly called for AI companies to compensate publishers for their content. Three other IAC properties - People, Entertainment Weekly and Investopedia - block GPTBot but no other AI web crawlers.

Several websites on the political right decline to block AI web crawlers, among them GB News, Newsmax, Zero Hedge, Breitbart and, despite other Murdoch-owned titles all blocking the bots, Fox News. The Drudge Report also effectively allows its site to be crawled because it does not appear to have a robots.txt page at all.

[Read more: Politico embraces generative AI web crawlers with website redesigns]

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog

Websites in our network