Support Centre

You have out of 5 free articles left for the month

Signup for a trial to access unlimited content.

Start Trial

Continue reading on DataGuidance with:

Free Member

Limited Articles

Create an account to continue accessing select articles, resources, and guidance notes.

Free Trial

Unlimited Access

Start your free trial to access unlimited articles, resources, guidance notes, and workspaces.

UK: ICO consultation series on generative AI and data protection part one - lawful basis for web scraping

The Information Commissioner's Office (ICO), the UK data protection authority responsible for enforcing the UK General Data Protection Regulation (UK GDPR), announced earlier this year its series of consultations on how aspects of data protection law should apply to the development and use of generative artificial intelligence (AI) models. The term 'generative AI' refers to AI models that create new content, which includes text, audio, images, or videos. The ICO recognizes that responsible deployment of AI has the potential to make a positive contribution to society, and intends to address any risks so that organizations and the public may reap the benefits generative AI offers.

The ICO guidance responds to a number of requests for clarification made by innovators in the AI field, including the appropriate lawful basis for training generative AI models, how the purpose limitation principle plays out in the context of generative AI development and deployment, and the expectations around complying with the accuracy principle and data subjects' rights

The ICO has published a series of chapters, which outline its emerging views on its interpretation of the UK GDPR and Part 2 of the Data Protection Act 2018, in relation to these questions. The ICO is in the process of seeking the views of stakeholders with an interest in generative AI to help inform its positions. In part one of this Insight series, James Castro-Edwards, from Arnold & Porter, delves into chapter one of the ICO's guidance, focusing on legitimate interests as a lawful basis, the risks involved in web scraping, and measures that developers can take to mitigate such risks.

Tim Grist Photography/Moment via Getty Images

Chapter one: The lawful basis for web scraping to train generative AI models

Chapter one of the ICO series concerns the lawful basis for web scraping to train generative AI models. It sets out the background and explains the collection of training data as part of the first stage of the generative AI lifecycle, which is divided into the following five stages:

  1. data collection;
  2. data pre-processing;
  3. training and model improvement;
  4. fine-tuning; and
  5. deployment.

Training data may be collected from a variety of publicly accessible online sources, including blogs, social media, forum discussions, product reviews, and personal websites. It may include images, video, text, and individuals' contact details. The process is frequently referred to as 'web scraping,' and involves the collection of data by way of automated software that 'crawls' web pages. Developers of generative AI may collect the data directly from the sources themselves, from third-party vendors that carry out web scraping on their behalf, or a combination of the two.

Training data may include personal data, which must be processed in accordance with applicable data protection laws. Developers must bear in mind that personal data included within training data may have been uploaded to a website by individuals themselves or it may have been posted online by someone other than the data subject. They must also ensure that their collection and use of training data does not breach other laws such as intellectual property or contract law.

Where training data includes personal data relating to individuals who are located in the UK, the UK GDPR is likely to apply. The UK GDPR specifies that, in order to collect personal data for generative AI model training purposes, developers must ensure that they have a lawful basis for processing. The UK GDPR provides six potential lawful bases for processing personal data. However, the ICO currently takes the view that the only available lawful basis for the collection and processing of personal data for the purpose of training an AI model will be legitimate interests (Article 6(1)(f) of the UK GDPR). Accordingly, chapter one of the ICO consultation focuses on legitimate interests as a lawful basis for processing.

In order to establish legitimate interests as a lawful basis, developers must pass the 'three-part test.' Chapter one draws on existing ICO guidance on legitimate interests as a lawful basis for processing, which requires developers to demonstrate:

  • the purpose test: The purpose of the processing must be legitimate;
  • the necessity test: Processing must be necessary for that purpose; and
  • the balancing test: The interest being pursued must not override individuals' interests.

The purpose test

Controllers must identify a specific interest for processing web-scraped data in the first place. This may be a business interest or wider social interest, but it must be expressed in specific rather than open-ended terms. Developers may deploy an AI model for commercial gain, either on their own platform or by making it available to third parties. To rely on any potential wider social benefits, a developer would need to be able to state exactly what these are. However, if a developer does not know what its AI model will be used for, it cannot ensure that its use will comply with data protection law and respect individuals' rights and freedoms.

The necessity test

The necessity test is a factual assessment to establish whether web scraping is necessary to achieve the interest stated in the purpose test. The ICO recognizes that generative AI training requires large datasets that can only be collected using large-scale data scraping, and that as the technology currently stands, training a generative AI model using smaller datasets is unlikely to be effective. 

The balancing test

Individuals' interests must be weighed against those of the controller, which requires an assessment of the likely impact that the processing would have on them. The ICO describes the collection of individuals' personal data through web scraping as 'invisible processing,' which is problematic since affected individuals are unlikely to be aware that it is taking place. Individuals whose personal data is subject to invisible processing will find it more difficult to retain control over their data or exercise their rights (which the ICO describes as 'upstream risks and harms'). Generative AI can also be used to generate inaccurate information about people, resulting in distress or reputable harm. It may also be used by cybercriminals as a social engineering tool to generate phishing communications tailored to individuals to perpetrate fraud (the ICO describes these as 'downstream risks and harms').

Mitigation strategies

The ICO identifies a number of mitigating measures that AI developers may employ in order to ensure that individuals' interests are not overridden, and thereby pass the third element of the three-part test.

Deployment by the original developer

Where an AI developer has relied on the public interest of the wider society as their legitimate interest (i.e., part 1 of the three-part test) and deploys their model on their own platform, they should be able to control and evidence whether the generative AI model is actually used for the stated wider benefit. The developer should also be able to assess risks to individuals in the development and post-deployment phases and implement measures to address such risks.

Third-party deployment through an API

A developer may make their model available to a third party through an API, such that the third party does not have their own copy but can raise queries through the API. This can be described as a 'closed-source' approach. A closed-source approach enables the developer to take steps to ensure that the third party's use of their model aligns with the objective identified as the legitimate interest (i.e., part 1 of the three-part test). For instance, the developer could limit queries that could result in a risk of harm to individuals, and monitor the third party's use of the model. The developer could also bind the third party with contractual protective measures.

Open-source approach

By way of contrast in an open-source approach, the developer provides third parties with more detailed information so that customers can run their own instance of the model. This results in the developer having far less control. In an open-source approach, a generative AI model may be implemented in unlimited ways, such that it would be impossible for developers to restrict or monitor how the model is used, and hence its impact on individuals. The developer is unlikely to have any knowledge of whether the broad societal aims have been achieved.

Conclusion

The ICO concludes that training generative AI models using web-scraped data may be feasible if developers adhere to their legal responsibilities and evidence that they have done so. A key aspect of this will be the legitimate interests test, which requires developers to identify a clear and valid interest, carry out the balancing test against individuals' interests (particularly where they do not retain control of the model), and implement appropriate mitigation measures. The ICO's guidance on legitimate interests as a lawful basis for large-scale data scraping for the purposes of training generative AI models draws on the ICO's existing detailed guidance on legitimate interests. As such, it should already be familiar to developers of generative AI models. The key challenges for developers will be identifying a specific legitimate interest and balancing this against individuals' interests. The deadline for comment on the ICO's first call for evidence was March 1, 2024.

James Castro-Edwards Counsel
[email protected]
Arnold & Porter, London