Since ChatGPT was unveiled to the world in November 2022, news executives – like the rest of the media and creative industries – have been up in arms about the unauthorised use of our intellectual property (IP) for the training of artificial intelligence (AI) systems.
Whilst these disputes about historic training matter on principle and for natural justice, commercially they are largely a distraction. Instead in the news sector we should be focusing our attention on the secondary use of our data. I’ll explain why.
First though, let me be clear: there is undoubtedly a moral case to answer around training. The core principle of copyright – to reward the efforts and investment of creators, and to prevent others using their works – has been undermined.
There is a legal case too. Although there are complex issues for lawyers to unpick. There’s the application of IP law to the precise technological process of large language model (LLM) training. And there’s the jurisdictional questions on where these processes took place and where any economic damage to rights holders was incurred.
The government here in the UK seems intent on letting this play out in the courts rather than weighing in with an interpretation or a clarification of the law. Caught between the competing demands of attracting hypothetical AI investment and protecting the UK’s already world-class creative sector, they are prioritising the former. This is a shameful mistake.
Meta’s spokesperson, giving evidence to the House of Lords AI inquiry, suggested it would take a decade for legal precedents to be set. Much damage will be done in this time. As AI technology advances the risks grow of user engagement moving to synthetic instead of original media. With it move monetisation opportunities, undermining business models.
All rights-holders should be concerned. For some segments though – image libraries and periodical or non-fiction publishers for example – the threat is near-term and profound. Practically all their IP assets of economic value have been ingested and are at risk now of being substituted by users.
Value peaks and erodes quickly
News publishers are in a different, and arguably stronger, position. The training data cut-off date for any model is typically a year prior to its release. ChatGPT-4 for example, was released in March 2023 but was trained on data scraped in January 2022. That means it knows nothing of world events that occurred after that date.
The economic value of journalism is high at the time of publication and then erodes quickly. Typically traffic to an article peaks within 24 hours of its publication (and the relationship between monetisation and traffic is reasonably direct). Archival content represents a low-single-digital percentage of overall engagement with news. In short, our most valuable IP at any point in time is not in these models.
Yes, the entire output of our newsrooms since we launched our websites has been ingested for the training of these systems. And yes, we ought to be pursuing developers to make us whole after this flagrant abuse of our IP. But, frankly, to date, the economic damage has not been that great and strategically, the industry should instead be focused on the secondary use of our data by trained models through the processes of ‘grounding’ or ‘retrieval augmented generation’.
These secondary mechanisms entail directing a LLM at another source of information. Whilst, off-the-shelf, a model knows how words relate to each other statistically based on its training corpus, it does not have an understanding of the meaning of words or phrases. This results in responses that are plausible, syntactically and semantically correct, but factually inaccurate.
The use of an additional, secondary source of data means the AI can base its response on known, verified information. Or it can double-check its output. Google and OpenAI are already using this technology to improve their products – see Gemini’s Check with Google feature and ChatGPT’s browsing mode. These developments markedly improve the utility of AI chatbots, particularly for news and current events.
Real-time remuneration
Google, OpenAI and Anthropic have all given website owners the option to disallow their scraping bots. Despite this, AI firms will be looking for real-time access to premium, trusted content. A few licensing deals have been signed and more are there to be done. The reality though is that they will only be available to some publishers; global, premium content will be in demand.
Securing licences for grounding data creates substantial value for AI developers and the users of their services. And it does so without prejudicing the ongoing legal fights around training, which is likely a red line; it would threaten to crush even OpenAI, with its $80bn valuation, if it had to negotiate and buy licences for training (ChatGPT-4 was trained on 570GB of textual data from websites, books and articles etc.)
News publishers need to approach these deals with caution, though. Whilst they have the potential to deliver much-needed incremental revenue, the strategic interests of parties are not aligned. Despite what they may say, ultimately AI developers want to provide a one-stop, ‘answer anything’ assistant. That is the fundamental purpose of the technology and it doesn’t sit comfortably with publishers’ business models which rely on driving engagement on owned-and-operated platforms.
The devil will be in the detail and careful consideration will need to be given to the exact terms. For publishers the substitution risks are high and whether a particular agreement makes sense will hinge on price, summarisation format, attribution, branding and links back to the source.
The fight around training data is a noble one. And it is critical for the creative industries – and I would argue, society – that the principles of intellectual property prevail. But it would be better for the news media sector if we focused on how we are remunerated for journalism in real-time; where the true value is anyway.
Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our "Letters Page" blog