Machine learning and data protection

- what should be considered when using personal data in algorithms?

Machine learning is today’s buzzword. It is the most researched part of artificial intelligence and it can be applied to numerous purposes, for example from health care diagnostics to detecting frauds and granting loans. Machine learning is about algorithms that learn from the data that is used to train them. In future, those algorithms learn and develop independently, i.e. the algorithms work without being explicitly programmed.

The EU’s General Data Protection Regulation (GDPR) will apply from 25 May 2018. The key concept in the regulation is personal data and its definition is broad. According to the GDPR, personal data means any information relating to an identified or identifiable natural person. In practice, such a broad definition means that several machine learning applications use personal data in the training phase as well as later when the algorithm is applied. Hence, we have below complied seven tips that should be considered when using personal data in machine learning.

Asses the nature of the training data and input data

If the training data or input data (meaning the data that is fed into the algorithm when applying it) contains any personal data, the whole GDPR applies. Only anonymized data is left outside the scope of the GDPR . Instead, for example pseudonymized data is considered as personal data.

Please note that anonymization is still an uncertain measure as regards the GDPR, because statements of the authorities implies that only risk-free anonymization that irreversibly prevents identification would be compliant with the GDPR. In practice, especially in the era of big data, there is no completely irreversible way to anonymize data. However, the wording of the regulation implies that the regulation’s general risk-based approach applies also to anonymization. Thus, anonymization would be sufficient if identification is not reasonably likely possible. Because of these divergent positions, it remains to be seen how anonymization will be treated in courts.

Check the background of the data

The principle of purpose limitation restricts the use of personal data. According to the principle, data that has already been collected, can be used only for the purpose it was originally collected for. Hence, using personal data later in machine learning as a part of training data is not necessarily possible without obtaining a consent from the data subject.

However, it is possible to use personal data if the further processing is not incompatible with the original purpose. When assessing compatibility, account should be taken of, inter alia, the context in which the personal data was collected and the relationship between the data subject and the controller. In practice, as regards machine learning, it might be hard to find such an original purpose that currently fulfills the requirements of the regulation for further processing. Machine learning is an unfamiliar concept for most people and thus, it is difficult to come up with situations where further processing for machine learning should be understood from, for example, the relationship between the data subject and the controller.

However, there is an exception to the principle of purpose limitation. The use of data for statistical purposes is deemed not to violate the principle. It has been argued that most of big data analytics, which includes also machine learning, is statistical in nature. Hence, the GDPR offers an explicit pathway for machine learning to use retained data. Nevertheless, according to the recitals of the regulation, if machine learning is used for automated decision-making, it is not possible to invoke the statistical purposes exception. The recitals are not legally binding, so it remains to be seen what kind of import such a limitation placed in the recitals will have in practice.

Does the algorithm make decisions concerning individuals?

The GDPR regulates automated decision-making in Art. 22. The starting point is, that automated decisions cannot be made without obtaining a consent from the data subject. However, in order the situation to fall under Art. 22, the automated decision-making must be automatic. In practice, the requirement of automatic is not often met in machine learning since the algorithm provides only suggestions for decisions, i.e. it is merely a decisional support tool for a human being, and the human being considers the suggestions and makes the final decision. In these situations, the decision is not based on solely automated data processing and thus, Art. 22 and the prohibition of automated decision-making do not apply.

Provided that the situation falls under Art. 22, the data subject has the right to express his or her point of view and contest the decision. In addition, the data subject has the right to obtain human intervention on the part of the controller.

Data protection impact assessment should be carried out

As a rule of thumb, due to the newish nature of machine learning and its applications as well as the vast amount of data relating to machine learning, a data protection impact assessment should be carried out. This is clear especially when machine learning is used for decision-making concerning individuals.

Note the data subject’s right to withdraw consent

The use of consent as the legitimate reason for data processing as regards machine learning includes risks. According to the GDPR, data subjects have the right to withdraw their consents and to be forgotten, i.e. have their personal data erased. However, erasing a single piece of data from an algorithm is in practice difficult and sometimes even impossible, which means that in such situations the controller might not have the right to use the model or it should be re-created without this personal data. Nevertheless, only few machine learning algorithms retain data. A typical algorithm creates rules based on the training data and retains only those rules. It is likely that such rules are not perceived as personal data that could be demanded to be erased. Having said that, the GDPR does not offer an unambiguous answer to this, and thus, there remains a risk of having to erase and re-create the algorithm. Notwithstanding the aforementioned, if the processing is based on the statistical purposes exception, then the data subject does not have the right to be forgotten, when the processing is necessary for statistical purposes.

Pay attention to securing non-discrimination

Both the GDPR and human rights agreements, such as the European Convention on Human Rights, requires non-discrimination. As regards algorithms, discriminating effects occur easily and often unintentionally. Attention should therefore be paid to the training data, which should be quantitatively adequate and of non-discriminating quality. Certain groups of people should not be overrepresented or underrepresented in the training data. In addition, discriminating effects might exist in the criteria or technical policy that the algorithm is instructed to follow. Discriminating effects can arise from matters that seem completely trivial. For example, if an algorithm is instructed to take into account the postal code: this can in reality indicate both wealth and ethnicity.

Simplify communications and pay attention to transparency

The GDPR requires that personal data is processed transparently. Transparency materializes through different obligations of the controller to keep the data subject informed. All communications to the data subject should be carried out in a way that the given information is easily understandable, and the language used is clear and plain. For example, in the context of machine learning, all information should be provided without any technical jargon. Transparency should be borne in mind also when designing an algorithm, and acknowledge that also in future, it should be possible to provide information about the functions of the algorithm.

Many of the GDPR’s requirements are challenging due to the characteristics of machine learning. However, by taking the regulation into account already when designing the algorithm and onwards, it is possible to dissolve many of the potential problem situations.

Articles

Machine learning and data protection

Articles