Copyright, news and gen AI: Action needed from Government says Lords

House of Lords report says it was unfair for AI firms to use copyrighted work without permission or compensation.

Copyright definition - used in stories about AI and images taken from social media — Picture: Shutterstock/Feng Yu

A House of Lords committee says the UK Government “cannot sit on its hands” for a decade waiting for sufficient copyright case law to emerge from legal fights between AI companies and news publishers.

Instead the Government has been urged to share its view on whether copyright law provides “sufficient protections” for rightholders whose work is used to train large language models (LLMs) and, if necessary, set out options to future-proof the legislation.

The recommendation came in the House of Lords Communications and Digital Committee’s report on Large Language Models, in which peers said they “do not believe it is fair for tech firms to use rightsholder data for commercial purposes without permission or compensation, and to gain vast financial rewards in the process”.

The report said the potential benefits to society from LLMs do “not warrant the violation of copyright law or its underpinning principles”.

“The application of the law to LLM processes is complex, but the principles remain clear,” it said. “The point of copyright is to reward creators for their efforts, prevent others from using works without permission, and incentivise innovation.

“The current legal framework is failing to ensure these outcomes occur and the Government has a duty to act. It cannot sit on its hands for the next decade until sufficient case law has emerged.”

So far the most high-profile legal action being taken in the UK is Getty Images’ claim that AI image generator Stability AI infringed its intellectual property rights by using its pictures for training purposes. In December a High Court judge said the case should proceed to full trial as its legal principles remained untested.

In the US The New York Times is suing OpenAI and Microsoft for the use of its content for the training of ChatGPT and Bing’s generative search responses. However this will not necessarily have a direct implication for the UK as the two countries have a different definition of “fair use” in copyright law.

Mail, Metro and i publisher DMG Media, Guardian Media Group and the Financial Times all made submissions to the committee arguing AI companies have ignored the legal licensing avenues available to them for the use of their content.

DMG Media said it was “actively seeking advice on potential legal action” at the time of its submission, which was published in October.

It added: “The issue that must concern legislators and regulator[s] is this: now that machines have been trained to absorb information gathered by other parties, and organise, present and monetise it as a news publisher would, what effect will this have on the incentive for media organisations to continue to create and publish high quality, reliable news content – and what will be the impact on society and democracy should news organisations be forced to reduce investment or cease to operate.”

The Financial Times urged the Government not to wait for the courts before sharing its policy position: “Whilst there are ongoing legal proceedings which will establish case law in this area, this process will likely take years and much damage to publishers, and the broader creative sector, may be done in the interim.”

Similarly, Conde Nast chief executive Roger Lynch told US Congress this month that “a major concern is that the amount of time it would take to litigate, appeal, go back to the courts, appeal, maybe ultimately make it to the Supreme Court to settle – between now and then there would be many, many companies, media companies that would go out of business”.

The Financial Times continued: “Although it is for the courts to interpret the law, a clear statement from government, articulating a policy position to the effect that developers require licences to ingest copyright protected materials for LLM training purposes would support the establishment of genuine, good faith licensing negotiations between rights-holders and LLM developers.”

Guardian Media Group said: “This one sided bargain risks undermining the willingness of individuals and businesses to invest the time and resources required to maintain a vibrant open web. Given the mission of the Guardian to make our journalism as widely available as possible, this challenge to the future of the open web arguably matters to us more than any other news publisher.”

Copyright question ‘can and should be tackled promptly’

Baroness Stowell, chairman of the House of Lords Communications and Digital Committee, said: “One area of AI disruption that can and should be tackled promptly is the use of copyrighted material to train LLMs. LLMs rely on ingesting massive datasets to work properly but that does not mean they should be able to use any material they can find without permission or paying rightsholders for the privilege. This is an issue the Government can get a grip of quickly and it should do so.”

However Baroness Stowell also said the risks of AI should be addressed in a “proportionate and practical” manner so the UK does not miss out “on a potential AI goldrush”.

“One lesson from the way technology markets have developed since the inception of the internet is the danger of market dominance by a small group of companies,” she said.

“The Government must ensure exaggerated predictions of an AI-driven apocalypse, coming from some of the tech firms, do not lead it to policies that close down open-source AI development or exclude innovative smaller players from developing AI services.”

Opt-in or opt-out?

OpenAI told the committee it believes it “complies with the requirements of all applicable laws, including copyright laws” as “legally copyright law does not forbid training”.

But it also believes “there is still work to be done to support and empower creators” which is why it has created an “easy way” for publishers to opt out of their sites from being crawled and why it is doing deals with the likes of Associated Press and Axel Springer.

The report discussed these “difficult decisions” over whether access to data should be on an opt-in or opt-out basis.

Getty Images argued that “ask for forgiveness later” opt‑out mechanisms were “contrary to fundamental principles of copyright law, which requires permission to be secured in advance”.

Guardian Media Group said it is vital websites can opt-in, rather than opt-out after training has already begun.

It said it had discovered that the Chinese Tiktok owner Bytedance had “deployed crawlers to scrape UK news sites for what is described as search optimisation purposes. While we understand that its scraper does abide by robots.txt files once that code is added to a publisher site, the code required to activate the block is not published anywhere on the web.

“The deployment of such bots without a clear and transparent way for IP owners to block them demonstrates why it is so important that the presumption should be on IP owners opting-in to scraping, not the other way round.”

DMG Media said in its submission it had the question of whether to block OpenAI’s web crawler GPTBot “under active consideration”. However it would be “more complex” to do so with Google, it added.

“While it [Google] has acknowledged the need for a debate around how publishers can control the use of their content, at the moment it is extremely difficult for publishers to block Google crawlers,” DMG Media said. “This is because Google operates a number of crawlers, and does not make it clear to what extent they feed search or AI, or both.

“Although search referral traffic is less valuable than direct traffic, it is still a significant contributor to publisher revenue, especially for smaller publishers. Therefore, very few publishers have as yet blocked Google’s crawlers. This in turn raises the possibility that Google could use its dominance in search to lever dominance over OpenAI in generative AI.”

The peers therefore said that a voluntary code for AI being developed by the Intellectual Property Office “must ensure creators are fully empowered to exercise their rights, whether on an opt‑in or opt‑out basis.

“Developers should make it clear whether their web crawlers are being used to acquire data for generative AI training or for other purposes. This would help rightsholders make informed decisions, and reduce risks of large firms exploiting adjacent market dominance.”

Publisher reaction

News Media Association chief executive Owen Meredith said in response to the Lords conclusions: “It is right that this report reflects growing concerns over tech and AI developers using copyrighted material to train Large Language Models without consent– and securing their own financial benefits as a result. Content creators and rights holders devote significant time and effort and resources to their work, and our copyright laws rightly protect their interests, creating wider economic, political and social benefits as they do so.

“Yet as AI technologies advance, so we must ensure our legal framework remains effective. For news brands who invest so heavily in their journalism, their content must not be used without transparency, consent, and fair remuneration. As the Committee recommends, we hope the government will support content creators by making clear the applicability of copyright law and setting out legislative options for future proofing our robust copyright regime if necessary.

“Journalism will play as vital a role as ever in combatting the dangers generative AI poses to our society, particularly as we enter a year of elections across the world. Robust copyright protections fit for the future will ensure news brands can continue this role for a long time to come.”