AI scraper violations and what we can do about them: New research reveals scale of problem

Paul Hood reveals data from new Miso.AI research into AI scrapers and suggests some tech solutions.

AI app symbols on a smartphone homepage including ChatGPT, Gemini, Perplexity, Claude and Copilot — AI apps on a smartphone. Picture: Tada Images/Shutterstock

The urgent need for publishers to adopt innovative solutions to combat widespread AI content misuse was highlighted at a joint FIPP x PPA webinar on Wednesday.

Exclusive research by Miso.ai, which creates personalised AI search engines for websites, revealed some stark insights into the extent to which publisher content is being used without consent or permissions/licences.

The research inspected just over 2,700 publisher sites that had Robots.txt “Disallow” commands enabled against AI scrapers.

Miso.ai found that the collective set of publishers were targeting more than 1,300 unique bots with “disallow” statements – a significantly larger total list than was first suspected. But each publisher is on average targeting only around 15 bots.

And just 15% of the publishers are blocking, in robots.txt terms, Google Extended – the tech giant’s product token that allows websites to control whether their content can be used to train Google’s generative AI models, such as Gemini Apps and Vertex AI.

Among publishers that had said “disallow” to Perplexity’s bot in their robots.txt files, effectively asking the service not to use their content, 15-20% still had content from their article pages appear in the Perplexity chatbot.

In addition, 65-70% of those publishers still had content from their homepages appear in Perplexity and 35-66% of those publishers still had their images appear in Perplexity.

(Perplexity says it respects robots.txt directives but that it “may still index the domain, headline, and a brief factual summary” of a page blocking its bot.)

The search for solutions – what role might tech vendors play?

There are some innovative tech companies that are working to provide solutions that respect rightsholders. For instance Tollbit provides publishers with tools for managing and licensing digital content to LLM’s and AI developers.

Similarly, Human Native AI offers a platform where publishers can license their content to AI developers in a controlled environment, ensuring fair compensation and intellectual property protection. MadeByHumans offers a similar service for book publishers.

Yet it publishers seem reluctant to entrust their premium content to newly established tech vendors. These reservations are justifiable. Publishers have experienced a messy adtech vendor landscape for the past few years; it turns out that many of these vendors added negligible value – indeed they made their money through commissions that often only served to disintermediate publishers from advertiser spend.

The recent news about Meta’s secret revenue-sharing deal with Llama AI developers highlights that transparency isn’t necessarily baked into AI deals. The evolving landscape of AI content licensing is still nascent, and while early deals may offer new revenue streams, they also underscore the need for standardised, transparent frameworks that prioritise creators’ rights.

Blockchain: a path to transparency?

Blockchain technology offers a promising solution for enhancing transparency and accountability in AI content licensing. Here’s how blockchain can help ensure copyrighted content is traceable across the internet when used by AI developers:

Digital fingerprinting: Blockchain creates a unique digital fingerprint for each piece of content. This fingerprint is like a permanent, unalterable ID tag for the content.

Immutable record: Once content is registered on the blockchain, it creates an unchangeable record of ownership. This record includes details such as:

Who created the content
When it was created
Any subsequent changes or transfers of ownership

Transparent licensing: When AI developers license content, this transaction is recorded on the blockchain. This creates a clear trail of:

Who licensed the content
When it was licensed
The terms of the licence

Usage tracking: Blockchain can potentially track how and where the licensed content is used by AI systems. This allows creators to monitor the use of their work across different platforms and applications.

Automated payments: Smart contracts on the blockchain can automate royalty payments to content creators based on usage. This ensures fair and timely compensation.

Verification for AI companies: AI developers can easily verify the authenticity and licensing status of content they wish to use, reducing the risk of copyright infringement.

By implementing blockchain technology, the creative industry can create a more transparent, secure and fair ecosystem for content licensing in the age of AI. This system protects creators’ rights while providing AI companies with a clear framework for ethically sourcing and using copyrighted material.

Getting started

The need for robust, independent methods of verifying intellectual property ownership has never been more pressing. Platforms like Story Protocol are pioneering a new era in IP management by leveraging blockchain technology to provide a transparent and immutable record of ownership. This innovation adds a crucial layer of provenance to copyright ownership, ensuring that creators’ rights are protected before their content is licensed to AI developers.

By registering intellectual property on the blockchain, creators can tokenise their work as non-fungible tokens (NFTs), which serve as irrefutable proof of ownership.

This process not only secures creators’ rights but also streamlines licensing and monetisation processes through programmable smart contracts. These contracts automate the enforcement of licensing terms, ensuring that royalties are distributed accurately and without intermediaries.

The significance of Story Protocol lies in its ability to bridge the gap between traditional copyright laws and the digital age. This not only protects creators from unauthorised use but also offers AI developers a transparent and legally enforceable framework for licensing content.

In an era where AI-generated content is increasingly prevalent, Story Protocol’s approach aims to ensure that the creative economy remains fair, transparent, and sustainable.

By embracing such technologies, publishers can give themselves the best chance of safeguarding the future of premium content, whilst ensuring that innovation and creativity continue to thrive.

Topics in this article : Artificial Intelligence

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog

Paul Hood

Paul Hood is an independent management consultant for the publishing sector. He has more than two decades of experience driving innovation and revenue growth for leading UK publishers, including News UK, Reach and Bauer.