Continue reading on DataGuidance with:

Free Member

Limited Articles

Create an account to continue accessing select articles, resources, and guidance notes.

Free Trial

Unlimited Access

Start your free trial to access unlimited articles, resources, guidance notes, and workspaces.

Already have an account? Log in

Back

Feb 2023

EU: Overview of the AEPD-EDPS joint paper on machine learning

Artificial Intelligence

Artificial intelligence ('AI') has been identified by the EU as one of the most relevant technologies of the 21^st century, and a key strategic component for the EU's digital transformation. On its part, machine learning ('ML'), a sub-discipline of AI, relies largely on accurate and representative data sets.

With the aim to clear up common misconceptions surrounding ML systems (with special emphasis on the protection of personal data), the Spanish data protection agency ('AEPD') and the European Data Protection Supervisor ('EDPS') have convened again to prepare a joint paper with technology as the guiding thread, this time titled '10 misunderstandings about machine learning' ('the joint ML paper')¹. This document follows on from the AEPD-EDPS joint paper on '10 misunderstandings related to anonymisation'².

Bárbara Sainz de Vicuña, Isabela Crespo Vitorique, and Mercedes Ferrer Bernal, from GÓMEZ-ACEBO & POMBO ABOGADOS, S. L. P., provide an overview of the joint ML paper and how AI and ML interplay with data protection.

Sandipkumar Patel / Signature collection / istockphoto.com

Concepts of AI and ML

Before discussing the joint ML paper itself, it is worth recalling the general definitions of AI and ML.

AI

Although there is currently no legal definition of AI at EU level, pursuant to Article 3(1) of the European Commission's Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts ('the AI Act')³, an AI system would be a 'software that is developed with one or more of the techniques and approaches listed in Annex I and can, for a given set of human-defined objectives, generate outputs such as content, predictions, recommendations, or decisions influencing the environments they interact with'. Among these techniques and approaches, Annex I of the referred Proposal includes:

'(a) Machine learning approaches, including supervised, unsupervised and reinforcement learning, using a wide variety of methods including deep learning;
(b) Logic- and knowledge-based approaches, including knowledge representation, inductive (logic) programming, knowledge bases, inference and deductive engines, (symbolic) reasoning and expert systems;
(c) Statistical approaches, Bayesian estimation, search and optimization methods'.

This definition has been slightly narrowed down by the Council of the EU in December 2022 in its general approach⁴, defining AI as 'a system that is designed to operate with elements of autonomy and that, based on machine and/or human-provided data and inputs, infers how to achieve a given set of objectives using machine learning and/or logic- and knowledge based approaches, and produces system-generated outputs such as content (generative AI systems), predictions, recommendations or decisions, influencing the environments with which the AI system interacts'. The European Parliament must now adopt its own position before the next phase of the legislative process (i.e. the interinstitutional negotiations, the so-called 'trilogue') can proceed in order to reach an agreement as per the concept of AI and, more broadly, the final text of the proposed AI Act.

With an endless range of applications, AI can contribute, for example, to improving healthcare, sustainability, anti-money laundering, or cybersecurity.

As observed, conceptually, AI is an umbrella term for technologies that seek to imitate human cognitive capabilities (perception, reasoning/decision making, and actuation). Indeed, AI includes several approaches and techniques, including ML, machine reasoning, and robotics (although robotics also includes techniques that are outside AI)⁵. Particularly, the power of AI comes from ML, which has been a primary focus of AI research since the 1980s⁶.

ML

As with AI, there is neither a legal concept of ML at EU level currently. ML is a branch of AI which focuses 'on the development of systems capable of learning and inferring from data to solve an application problem without being explicitly programmed with a set of step-by-step instructions from input to output', as introduced by the Council of the EU in its general approach to the proposed AI Act (Recital 6), which also includes examples of ML approaches such as 'supervised, unsupervised and reinforcement learning, using a variety of methods including deep learning with neural networks, statistical techniques for learning and inference and search and optimisation methods'.

In other words, ML enables computers to learn without explicitly being programmed. It does this by enabling computers to train themselves on data and learn to program themselves via experience. Therefore, ML starts with data (e.g. texts, images, etc.) that is collected and prepared to be used as training data. Then, programmers select a ML model to use, provide the data, and let the computer model train on its own with the aim to explain what happens (descriptive function), make predictions (predictive function), or suggestions about what action to take (prescriptive function).

Some sectors where ML systems are of significant importance include, as a matter of example, healthcare (e.g. medical diagnostics), digital marketing (e.g. client personalisation), education (e.g. knowledge assessment), or search engines (e.g. enhancement of functionalities).

Main ideas drawn from the joint ML paper

Briefly, the following ideas may be highlighted from each of the ten misunderstandings about ML systems pointed out by the AEPD and the EDPS in their joint ML paper:

Correlation does not imply causality

While the term 'causality' refers to the link between cause and effect, 'correlation' refers to the connection between two or more factors that occur with certain synchronisation. ML systems are particularly efficient identifying correlations, but fall short of analytical ability and hence the need to involve the human supervisory component to establish relevant factors or causes for a prediction or classification.

Adding more data may not necessarily improve the performance of ML models (on the contrary, it could introduce more bias)

Training ML systems requires large amounts of data, depending on the complexity of the task at hand. Nevertheless, performance may not always be enhanced by ML model creation processes that incorporate additional training data. In this regard, the AEPD and the EDPS note that the General Data Protection Regulation (Regulation (EU) 2016/679) ('GDPR') mandates that the processing of personal data be proportionate to its purposes, and that, from a data protection standpoint, 'it is not a proportionate practice to increase substantially the amount of personal data in the training data set to have only a slight improvement in the performance of the systems'.

High-performing ML systems require training data sets over a certain quality threshold (but do not need completely error-free training data sets)

Statistical science suggests that regardless of individual errors in input data, it is still possible to calculate the average result accurately when processing large volumes of data. Indeed, as ML models rely on the overall quality of the vast data sets used to train them, they are tolerant to occasional inaccuracies on individual records. To illustrate that, the joint ML paper mentions that ML models trained with synthetic data (i.e. artificially generated training data sets that emulate real-world data) or using differential privacy (i.e. a technique aimed to preserve privacy by injecting noise into training data sets) achieve good performance.

The development of ML systems does not necessarily require large repositories of data or the sharing of data sets from different sources

On the basis that no learning architecture may fit all tasks, the AEPD and the EDPS note that, although centralised learning (i.e. 'pooling of both data and ML system into a cloud computing infrastructure controlled by the ML developer') is a common development architecture solution, it may not always be the best solution, and may entail certain risks and considerations from a data protection standpoint in connection with the purpose limitation principle, security matters, exposure to data breaches, etc. As opposed to centralised learning architecture, the joint ML paper proposes stakeholders to explore other alternatives, including distributed on-site and federated learning.

Distributed on-site learning refers to cases where 'each data controller server downloads a generic or pretrained ML model from a remote server. Then each local server uses its own data set to train and improve the performance of the generic model. After the remote server has distributed the initial model to the devices, no further communication is necessary'. Conversely, federated learning includes cases where 'each data controller server trains a model with its own data and sends only its parameters to a central server for aggregation'.

Once deployed, ML models performance may deteriorate and may not improve without further training

No matter how much data is provided to it, a ML model that has been deployed and is no longer trained will not continue to learn further correlations from incoming data. In other words, performance of ML models may decline as they may not be expected to evolve unless they are continually trained. Thus, the accuracy of the system may be at risk, as its obsolescence towards reality could jeopardise its ability to make appropriate and fair judgments. Considering that the processing environment where the ML system operates can change over time, it is key to monitor the system to spot any degradation, taking appropriate action (e.g. by further training the model with new data, always in compliance with the applicable data protection regulations).

A well-designed ML model may produce decisions that all relevant stakeholders can understand

There are several approaches for explaining decisions based on ML models. Depending on the individuals and the context, several degrees of explanatory elaboration may be required. The best strategy will be one that can effectively explain in a clear manner the steps taken to reach decisions since the training and creation of the ML model.

It is possible to provide meaningful transparent information to data subjects without damaging intellectual property

The GDPR generally requires controllers to adequately inform data subjects of the processing carried out of their personal data (Articles 13 and 14 of the GDPR). In the context of ML processing, controllers should be wary to inform data subjects about the potential impact of the processing in their daily lives, by providing meaningful information that explains the logic applied, its significance, and expected consequences (Articles 13(2)(f) and 14(2)(g) of the GDPR). However, transparency in the context of ML systems does not necessarily imply disclosing detailed technical information as in most cases it would be meaningless to users. Among others, the AEPD and the EDPS cite input personal data and data generated as output, disclosures to third parties, and risks to rights and freedoms as examples of meaningful information.

ML systems are subject to a variety of biases (and some of these may come from human biases)

ML models can be freed from human bias or favouritism toward an individual or a group based on their inherent or acquired characteristics. Nonetheless, ML systems are selected, designed, tuned, and trained with data that is typically selected by humans. Studies show that ML systems could be susceptible to more than 20 types of bias stemming from their data processing. For instance, some of these may come from human decisions or they may replicate them (e.g. a model trained with historical CEO profiles may be biased toward male candidates).

Predictions made by ML systems may only be accurate when future events follow past trends

ML processes data to project potential future outcomes. As a result, ML systems do not make guesses about the future, but rather forecasts based on past events. Most ML models may require a significant amount of new data in order to adjust their predictions in completely new scenarios or rapidly changing circumstances.

The capacity of ML to detect non-evident correlations in data may lead to the discovery of new data, which may not be known to the data subject

Data correlation is a strong suit of ML systems, as they are able to identify patterns in personal data that has not been explicitly sought and may even be unknown to the persons affected (e.g. a predisposition for a disease). From the perspective of data protection, this possibility gives rise to a number of concerns. For instance, data subjects may be impacted by decisions made based on information they were not aware of, or were unable to foresee or respond to. When ML systems use personal data to draw conclusions that go beyond the stated purpose of the processing (e.g. predictions via profiling), controllers still need to preserve all data protection principles established in the GDPR, including lawfulness, transparency, and purpose limitation (Articles 5(1)(a) and 5(1)(b) of the GDPR). Moreover, '[a]ny type of further processing of personal data requires a legal basis and a clear purpose'.

Conclusion

As we enter the new year, and in connection with the technological trends that are going to mark year 2023, we have taken the opportunity to discuss the AEPD-EDPS joint ML paper. Indeed, before designing, developing, and implementing a project involving ML, and to the extent personal data is processed, a comprehensive analysis should be conducted from a data protection standpoint, with special emphasis on its impact on data subjects' rights and freedoms. To that end, understanding the ML system at hand from a technical perspective, how it operates, its possibilities, and risks becomes crucial.

Bárbara Sainz de Vicuña Senior Associate
[email protected]
Isabela Crespo Vitorique Senior Associate
[email protected]
Mercedes Ferrer Bernal Associate
[email protected]
GÓMEZ-ACEBO & POMBO ABOGADOS, S. L. P., Madrid

1. Available at: https://edps.europa.eu/data-protection/our-work/publications/papers/2022-09-20-aepd-edps-joint-paper-10-misunderstandings-about-machine-learning_en
2. Available at: https://edps.europa.eu/data-protection/our-work/publications/papers/aepd-edps-joint-paper-10-misunderstandings-related_en
3. Available at: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:52021PC0206&from=EN
4. Available at: https://data.consilium.europa.eu/doc/document/ST-14954-2022-INIT/en/pdf
5. See 'A definition of AI: main capabilities and disciplines' of the Independent High-level Expert Group on Artificial Intelligence set up by the European Commission (p. 5): https://digital-strategy.ec.europa.eu/en/library/definition-artificial-intelligence-main-capabilities-and-scientific-disciplines
6. See 'The impact of artificial intelligence on the future of workforces in the European Union and the United States of America. An economic study prepared in response to the US-EU Trade and Technology Council Inaugural Joint Statement' of the Council of Economic Advisers of the White House (p. 4): https://www.whitehouse.gov/cea/written-materials/2022/12/05/the-impact-of-artificial-intelligence/