Mieke De Ketelaere, external AI expert
I have personally experienced the harsh laws of television towards the end of that episode. As I was explaining towards the end that there are several ways you can ensure privacy without compromising on innovation, we were told there was not enough time to delve deeper into that observation. But that’s what I love about the variety of media available today: I can share these options here with you today.
Basically, the following are my three favorite methods of combining privacy with innovative data usage.
1. Synthetic data generation
Synthetic data is computer-generated data that mimics real data; in other words, data that is created by a computer, not by humans. So, it is not collected by any real-life survey or experiment and not referring to any existing individual.
Initially, synthetic data was created for better Machine Learning algorithms. Nowadays, we start to understand its purpose can be much more far-reaching, more specifically to get around security and privacy concerns with real datasets, when these data cannot be used or acquired for learning purpose. Typical examples would include medical or military data, which are by default very sensitive. If you are looking for even more depth and detail, I gladly refer you to this page.
Synthetic data however does not come without its limitations. While synthetic data can mimic many properties of authentic data, it does not copy the original content exactly. Models look for common trends in the original data when creating synthetic data and in turn, may not cover the corner cases that the authentic data did. In some instances, this may not be a critical issue. However, in some system training scenarios, this can severely limit its capabilities and negatively impact the output accuracy. Something we need to be aware about.
2. Federated Learning
There is no doubt, self-learning is a powerful tool for innovation. But when a model needs training and is fed with private data, data centralization can become a serious issue. Privacy advocates may oppose the ideal of bundling data that could lead to identifying individuals. If this bundling is done by one single party, everybody needs to fully trust this party with these datasets. But if the data are not centralized, the learning effects may be significantly limited.
This is where federated learning comes in place. Federated learning was initially proposed by Google researchers in a paper published in 2016. They describe federated learning as an alternative to centralized AI training: a shared global model is trained under the coordination of a central server and is using data from a federation of participating devices. In that model, the different devices can contribute to the training and knowledge of the model while keeping most of the data within the device.
Google describes the approach to federated learning in four simple steps:
- A subset of existing clients is selected on several devices, each of which downloads the current model.
- Each client in the subset computes an updated model based on their local data.
- The model updates are sent from the selected client devices to the server.
- The server aggregates these models (typically by averaging) to construct an improved global model.
Federated learning clearly combines the best of both worlds: it distributes the quality of knowledge across a large number of devices and thus avoids the necessity to centralize the data used to optimize and train the model. It can thus improve the quality of centralized machine learning models while maintaining the privacy of the training datasets.
However, federated learning does not come without problems. Like any other software architecture, decentralization introduces challenges in terms of work coordination, management or monitoring. Federated learning should therefore be viewed as an interesting complement, rather than an alternative, to the traditional centralized learning architectures.
3. Differential Privacy
Differential privacy can solve problems that arise when the use of sensitive data is needed, and anonymization of the data isn’t good enough. For example, in 2007, Netflix released a dataset of their user ratings as part of a competition to see if anyone can outperform their collaborative filtering algorithm. The dataset did not contain personally identifying information, but researchers were still able to breach privacy; they recovered 99% of all the personal information that was removed from the dataset. In this case, the researchers breached privacy using auxiliary information.
Differential privacy offers a solution in this context. Differentially-private algorithms are resilient to adaptive attacks that use auxiliary information. These algorithms rely on incorporating random noise into the mix so that everything an adversary receives becomes noisy and imprecise, and so it is much more difficult to breach privacy (if it is feasible at all). This technique is in use, among others, at Google and at Apple. A differential privacy method ensures the anonymity of each member in the group throughout the information retrieval process.
The important remark of this approach is however that you still need to make a trade-off between utility and information leakage. The more you protect individual privacy, the less accurately you can compute aggregate statistics about the collection.
I guess the conclusion is that there’s no free lunch. I hope however that all stakeholders involved in AI business can continue their efforts in testing these technical solutions and maybe even developing other solutions, which will allow innovation and privacy to walk hand-in-hand.
In a next blog post, I will gladly divulge some more mechanisms, such as ‘fair algorithms’, that will contribute to an environment that can fully explore new innovations without compromising on data subjects’ privacy.
Curious to find out more? Then you have just found one more reason to join us on our Curiosity forum! Do not hesitate, join us on 13 June by registering here.