
Publishers have been warned that AI companies are relying on third-party content scrapers to steal publisher content, even if they block bots from those AI firms from visiting their sites.
Third-party companies, some of which openly boast about how they can get through paywalls, effectively steal content to order, allowing AI companies to answer ‘live’ news queries with stolen information from publishers.
A recent Digital Digging report found AI bots bypass paywalls by methods including reading non-paywalled versions of stories on other publisher sites, scraper sites and on internet archives.
Internet services provider Cloudflare announced in July it would block AI ‘scrapers’ accessing content by default, a move backed by major publishers – but experts have warned loopholes may remain.
AI ‘scrapers’ are sent out by large language model services such as OpenAI both to ‘train’ models, and to ‘ground’ information in real sources, as seen when ChatGPT, for example, offers AI-written answers with links to publisher sites – a process known as ‘retrieval augmented generation’ (RAG).
Cloudflare now blocks all AI scrapers that don’t have permission or provide compensation to publishers, with a million Cloudflare customers thought to already block requests from AI bots.
Cloudflare told Press Gazette that AI companies are not always transparent about what their ‘scrapers’ are doing, and that some rely on third-party companies to gather data for them – and has called for standards to ensure transparency.
The volume of such requests is staggering. Cyberfraud protection platform DataDome recorded 178.3 million scraping requests to sites it supports just from OpenAI, maker of ChatGPT, this January. It said that 36.7% of traffic to sites it works with now comes from sources other than internet browsers, including bots.
But experts in the field have warned that it’s not always clear how large language models gather the data that they use, and that ‘third-party’ scraper companies are emerging to provide AI companies with data in a ‘hands off’ fashion.
There are also concerns that it has become very easy to set up scrapers, even for amateurs – and that data from search engines (which may bypass bot protections) may be used to ‘ground’ large language models, making it harder to block bots.
Researchers at security firm Human set up ‘honey pot’ sites – deliberately weak ‘target’ sites used in cybersecurity to ‘lure in’ attackers – to catch large language models ‘scraping’ target sites without permission.
Bryan Becker of Human said: “We ran a lot of experiments on our side trying to force the large language models to scrape us. And they actually never came, but then they still got the data.”
Human has said this shows that large language models “most likely outsource” their data gathering.
A host of sites openly offer scraping services, and some boast how they can be used with AI products.
One company boasts that its scrapers are “used by leading AI-powered companies to gather public data while overcoming the strongest safety measures” and that it can “get data from any website”.
Becker said: “In our initial testing of AI scraping agents, we were unable to trigger an on-demand scrape against any of our honey pot sites.
“However, we were able to retrieve responses, indicating that the agent had access to the site content. Our logs identified many signatures of popular scraping services, leading us to believe that these products most likely outsource their data gathering. These are still preliminary results, and we will publish a more in-depth article when the research is complete.”
Becker said that the demand for web scraping has gone up – and there is a two-fold effect, because large language models need to ‘scrape’ sites and also make web scraping accessible to more people.
Statistics from Human suggest that the volume of scraping attacks on publishers has risen 56% in the past year.
Becker said: “Five years ago a market might buy a product to scrape their competitor sites in order to get some competitive intel, or to just scrape articles to feed the beast of content that they want to write about, but they’d have to buy a product – they definitely wouldn’t programme a scraper as that’s not in their skill set.
“Now you can go and quite easily come up with your own logic without actually needing to learn how to code. It’s driven the demand up.”
The problem is that current “defence mechanisms” against crawlers, such as ‘robots.txt’, the Robots Exclusion Protocol, designed to stop automated software from crawling sites (as opposed to human visitors), is not legally binding and are more akin to a ‘gentleman’s agreement’, which is routinely ignored.
Wired.com found that Perplexity was able to reconstruct an article from its site, despite it being behind a robots.txt.
Perplexity has since amended its policies, and now clarifies that it respects robots.txt and only employs third-party companies who do so also.
In off-the-record conversations with major newspaper publishers in the UK, experts confirmed that third-party scrapers are an increasing issue.
In tests, Press Gazette was able to access paywalled news sites such as the Financial Times through third-party crawler software.
Becker said: “Robots.txt has no enforcement mechanism. It’s a sign that says ‘please do not come in if you’re one of these things’ and there’s nothing there to stop you. It’s just always been a standard of the internet to respect it.
“Companies the size of Google, they respect it because they have the eyes of the world on them, but if you’re just building a scraper, it’d almost be more work for you to respect it than to ignore it, because you’d have to make extra code to check it. The paywalls are a different story, because that does have an enforcement mechanism, but it’s trivial to bypass, and I don’t think it actually slows down any scrapers in the real world at all.”
“For publishers, it’s potentially existential for them. Someone’s stealing all their content and giving it away for free, and they’re not getting any views. How are you going to survive?”
Becker describes the situation publishers find themselves in today as an arms race, and warns that efforts by third-party scrapers to bypass protection mechanisms will continue
Becker said: “The arms race, I don’t think will ever go away, because the amount of money at stake here, the less scrupulous individuals in society will be tempted by that. I don’t think that’s ever going to go away.”
Becker believes that to survive, publishers will have to use bot mitigation technology, possibly paired with a “monetisation layer” where bots can be forced to pay, similar to the technologies offered by Cloudflare and Human.
Scraping has legitimate uses in many industries, for instance on price comparison sites or flight booking services.
But for publishers, whose business models depend on readers visiting content and paying for advertising, even if links are provided within the AI-written text of a chatbot answer the number of users who click through is extremely small relative to the number of crawls.
Cloudflare says the ratio is 70,900 AI crawls to one actual reader with some AI sites.
Toshit Panigrahi, CEO of Tollbit (a company which claims to control generative AI access to websites) said that 13 million crawlers from AI companies visited one leading sports site in a month, and this led to just 600 site visits.
In a report submitted to the House of Lords, the Financial Times said: “Despite the high commercial value of RAG to AI developers, the vast majority of companies take the raw materials required to create summarised simulacrums without any form of remuneration, licensing arrangement, or traffic back to the source publisher website. This is contrary to the terms of service of many publishers, and is neither fair nor sustainable.”
Will Allen, VP of product at Cloudflare, told Press Gazette: “AI crawlers aren’t always transparent about who they are or what information they’re collecting, which creates real challenges for content creators, particularly when their work is scraped without permission , compensation or attribution.
“In this environment, what’s needed is a programmatic way for crawlers to reliably identify themselves and declare their intent, whether that’s for indexing, training, or other uses.
“Some AI companies rely on third-party crawlers to collect data, which blurs responsibility and makes it harder for content owners to know who’s actually accessing their work. Without robust authentication and clear disclosure, this kind of delegation undermines transparency and makes it easier to bypass content restrictions.
“In a landscape where AI development increasingly depends on vast amounts of content, it’s critical to establish technical standards that require crawlers to cryptographically identify themselves and declare their intent, whether they belong to a major AI lab or a third-party partner acting on their behalf. It’s the only way to ensure real accountability.”
Dominic Young, a news industry strategist and advocate for publishers’ rights, said that the problem of third-party scrapers is leading publishers to look for their own tech solutions.
Young said: “The move by Cloudflare to double down on their bot blocking tech reinforces how important it is to get a grip on this activity. While it might seem incredible that content theft has been industrialised, the creative industries have recognised the importance and urgency of neutralising the threat and creating a proper marketplace around legitimate and authorised activity.
“Tools like Cloudflare’s can help where the law does not and I think we’ll see a lot more initiatives in this space soon.”
[Read more: There’s only one way the news industry will win in the AI era: together, Dominic Young writes]
“Ever since the advent of search, tech platforms have been copying and keeping publishers’ content in their systems. Publishers have generally been okay with this, because they have been using it in ways which are helpful, like sending traffic to websites. There have always been other crawlers, often detectable but not known, accessing content for reasons we either didn’t know or didn’t like. This has been much murkier, and hard to argue as being legal, but has often been ignored by publishers.
“This has led to the idea that scraping content is OK, regardless of what it is then used for, and the sense in some quarters that the law is in some way unclear or confusing – or even makes what they’re doing legal.”
Young has called for more clarity, particularly around bots which pretend to be human users, or which do not reveal which company they are crawling for – or how content which has been scraped might be used.
“Worse, it’s become very clear the results of scraping has been used for things publishers are definitely not okay with – for example, creating AI systems which substitute for readers needing to come to our products at all – directly siphoning off revenue and undermining the opportunity to build a direct relationship with audiences.
“This mass copying and disintermediation has become industrialised, and, when it comes to content scraping and AI, it seems that everyone is making a fast buck except the creators of the content without which none of it would exist. The existence of businesses which effectively provide ‘theft-as-a-service’ is incredible. This has to change.”
Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog