Layman’s Guide to Machine Learning and Customer Data Privacy By Jack Mardack January 16, 2020 It feels like only yesterday that “Machine Learning” and “Artificial Intelligence” were equated with science fiction by most regular folks outside of tech. As a movie-going culture, we have a long history of associating these concepts with self-aware robots or rogue computer programs that could escape the laboratories where they were created and threaten all humanity. But in 2020, ML and AI have been considerably demystified. And yet, even if it seems less likely now that the Singularity will be triggered by the reckless ambition of some tech startup, consumers have new reasons to be concerned. Yes, ML and AI are going mainstream, and it’s not just disruption-minded startups that are leveraging these technologies. Enterprises in a wide range of established industries are finding solid business reasons to fund these advanced projects and bring them out of the laboratory into production, with lots of exciting implications for their customers. One implication, which is the subject of this article, is the creation of a new class of personal data privacy vulnerabilities. And a majority of businesses that want to leverage ML are going to have to learn to protect their customers from these new vulnerabilities. These concerns arise in the first place because the “models” that make ML work have to be trained with data – lots of it. As enterprises seek to create business value from these new ML programs (such as conversational agents, real-time risk and fraud analysis, and predictive healthcare), they are going to look to train their models with customer data of some sort. In many cases, deeply private customer data. As we usher in what is certainly a new era in consumer awareness of data privacy rights, combined with the advent of new regulations such as GDPR and CCPA, it is timely to contemplate how ML and consumer data privacy will co-exist. No longer a hypothetical problem Unfortunately, some of the toothpaste has already escaped the tube. A number of recent controversies expose the potential scale of the ML + Customer Data Privacy problem. Google (whose health data-sharing arrangement with Ascension became the subject of scrutiny in November) ditched its plans to publish chest X-ray scans over concerns that they contained personally identifiable information. The Royal Free London NHS Foundation Trust, a division of the UK’s National Health Service based in London, provided Alphabet’s DeepMind with data on 1.6 million patients without their consent. This past summer, Microsoft quietly removed a data set (MS Celeb) with images of more than 10 million people after it was revealed that some weren’t aware they had been included. And it turns out those of us who’ve been getting a creepy feeling whenever we expressed our deepest desires to an AI-based wish fulfillment engine, had good reason to. Apple and Google have been the subject of recent reports that revealed the potential misuse of recordings collected to improve artificial agents like Siri and Google Assistant. In April, Bloomberg revealed that Amazon had been using contractors to transcribe and annotate audio recordings pulled from Alexa-powered devices, prompting the company to roll out new user-facing tools that let you delete your cloud-stored data. Why ML exacerbates data privacy Within a database, the various data points associated with an individual can be distinguished, from a privacy standpoint, by what class of information they contain. A dataset is made up of “data points” (specific members of a population) and “features” (the values of the attributes associated with each person). In the case of medical records, for example, features might be their name, age, gender, state, religion, and disease. The first column represents Personally Identifiable Information (PII), which uniquely identifies a person, eg: their full name or social security number. The second type of feature contained is termed Quasi-Identifiers (QI), which are categories like age or gender that may be attributable to more than one individual. Therefore, this information on its own is not sufficient for identification. However, if combined with other QIs and external information, it is sometimes possible to “re-identify” an individual. Traditionally, removing the column containing sensitive information in a dataset meant that this specific information could not be re-inferred from the dataset itself, but only by combining and querying external information. AI, however, can recreate identities even with the identity-indicator removed. From a set of job applicant resumes, for example, gender might be removed to protect against gender discrimination during the candidate evaluation process. Although the resumes have been de-identified in that sense, an ML tool might be able to pick up subtle nuances in language use and from this infer the candidate’s gender. Here, removing the column is not enough to strip out sensitive information securely. AI technologies have not historically been developed with privacy in mind. To reach reliable levels of accuracy, models require large datasets to ‘learn’ from. In order to shield individual privacy in the context of big data, different anonymization techniques have conventionally been used. The three most relevant are K-anonymity, L-diversity, and T-closeness, of which we will briefly examine the first. In K-anonymity, selected Quasi-Identifiers (eg: name, religion) of certain individuals are removed or generalized (eg: replacing a specific age with an age span) so that every combination of identity-revealing characteristics occurs in at least k different rows of the dataset. K-anonymity is a “hiding in the crowd” approach to protecting privacy. If each individual is part of a larger group, any of the records could correspond to a single person. L-diversity and T-closeness are extensions of this concept, which are described in more detail here. These modifications would be applied before data is shared or used in a training model. This is called Privacy-Preserving Data Publishing. However, with the rise of AI, this form of protection is insufficient. Conventional statistical modeling methods would only be able to consider a limited number of variables. But today, because of regularization techniques and the declining cost of cloud-based computation, it has become possible for ML models to consider thousands of variables from which to make a single prediction. With algorithms that can make inferences from such large and complex datasets, three new conceptual issues arise. Firstly, with the expanded dimensionality in ML training sets, implicitly there is now a greater likelihood of sensitive information being included. Second, these powerful new models are more likely to be able to discern that sensitive information (eg: reconstructing gender from subtle differences in word choice). And thirdly, ensuring comprehensive privacy and anonymity for the vast amounts of data incorporated into complex ML models itself presents a major challenge. Intro to Privacy-Preserving Machine Learning To address the above challenges, there are a number of promising techniques that are being tested to provide suitable protection of individual data privacy in ML. These include Federated Learning, Differential Privacy, and Homomorphic Encryption. For the most part, these are all in the preliminary stages of exploration as regards their potential use to protect consumer data privacy in ML at scale, and are in the hands of researchers in academe or at the largest technology players. Which of these becomes the standard and how they will bridge the gap to meet the needs of ML in production remains to be seen. Federated Learning Federated Learning is an example of the more general approach of “bringing the code to the data, instead of the data to the code”, and thus addresses some of the basic problems of privacy, ownership, and physical location of data. Federated Learning is a collaborative approach that involves training ML models on a large set of decentralized data present on multiple client devices. The model is trained on client devices and thus there is no need for transferring the user’s data. Keeping their personal data on the client’s device enables them to preserve direct and physical control of their own data. Holding the data samples on client devices, without the need for exchanging those samples, enables multiple parties to develop a common ML model without having to share all the data amongst themselves which creates the increased vulnerability that comes from putting all the data together in any one place. Google, a Federated Learning pioneer, has used FL for personalization in its Gboard predictive keyboard across tens of millions of iOS and Android devices. And together with the launch of the Pixel 4, Google debuted an improved version of its Now Playing music-recognition feature that aggregates the play counts of songs in a federated fashion, identifying the most popular songs in a given geographic location. Among the drawbacks of the Federated Learning approach are the fact that it requires lots of processing power and memory from the federated devices. Also, because the models can only be trained when the devices are connected and able to transfer data, this may introduce a situational bias to the data that enters the model. For example, a user may listen to different music sources (and therefore different songs) when on WiFi versus cellular data. And lastly, Federated Learning is vulnerable to “Poisoning Attacks”, where a generative adversarial net (or GAN) may pretend to be a benign participant to gain control of the model. Differential Privacy Differential Privacy is a promising, if not new, approach to the preservation of privacy in ML. Developed by Cynthia Dwork et al at Microsoft in 2006, DP attempts to ensure that no individual can be linked to the data used to train an ML model. This doesn’t mean you can’t discover anything about an individual in a dataset. For example, publishing data that shows a strong correlation between smoking and lung cancer would reveal sensitive information about an individual known to smoke. Rather, the ultimate privacy goal is to ensure that anything that can be learned about an individual from the released information, can be learned without that individual’s data being included. In general terms, an algorithm is “differentially private” if an observer examining the output is not able to determine whether a specific individual’s information was used in the computation. DP works to protect individual privacy by adding random noise to the dataset in a carefully determined distribution, which will lead to the “perturbation” of the true answer. The true answer plus noise is always returned as output to the user. The degree of perturbation can be accounted for so that overall accuracy does not significantly decrease, while for individual data there always remains a degree of “plausible deniability” due to the randomness of the noise. A desirable aspect of DP is that it is mostly compatible with, or even beneficial to, meaningful data analysis despite its protective strength. Within empirical science, there is often the threat of “overfitting” data to permit conclusions that are specific to the dataset, and lose accuracy when predictions are generalized to the larger population. Because DP also offers protection from such overfitting, its benefits thus go even beyond data security. Apple has been using some form of DP since 2017 to identify popular emojis, media playback preferences in Safari, and more. The company combined DP with Federated Learning in its latest mobile operating system release (iOS 13). Both techniques help to improve the results delivered by Siri, as well as apps like Apple’s QuickType keyboard and iOS’ Found In Apps feature. The latter scans both calendar and mail apps for the names of contacts and callers whose numbers aren’t stored locally. Homomorphic Encryption Homomorphic Encryption, like DP, is not new but is enjoying renewed relevancy for its potential utility in privacy preservation for Machine Learning. The essential idea is that we can use data in encrypted form to train and run the ML model. From Wikipedia: “Homomorphic Encryption is a form of encryption that allows computations to be carried out on ciphertext, thus generating an encrypted result which, when decrypted, matches the result of operations performed on the plaintext.” For example, this means you could use “Qhjr Thykhjr“ (using the Ceasar cipher) in an ML training model in place of my name (Jack Mardack), and return a similarly encrypted output. You can also encrypt the ML model itself, which is valuable in the case of Federated Learning, where it is necessary to transfer the model to data (eg: to the customer’s device). This means you can protect the model itself with encryption, as well as the training data. It may seem obvious to use encryption to protect data, but the usefulness of the approach is significantly reduced because of performance implications. Homomorphic Encryption libraries don’t take advantage of modern hardware acceleration, making the ML models ~10X slower than other approaches. But there are research teams at Intel, Facebook, and IBM (among others) that are working to help close the gap. There is naturally a lot of interest in Homomorphic Encryption in ML use cases from the more highly-regulated industries, such as healthcare and banking, where the possibility of end-to-end encryption is highly desirable. Conclusion We are at an interesting crossroads, to be sure. There is near-universal agreement that ML and AI are poised to radically transform human experience on multiple life-changing dimensions, from how we stay healthy, to how we work and create, to the facilitation of myriad mundane human activities. But it seems both sides of the risk/reward scale are changing for consumers. Up until now, the benefits of sharing our data with commercial enterprises have been comparatively modest – relating to more personalized news feeds in our social apps, or more relevant recommendations from the e-commerce sites we buy from. Very soon, however, the value we stand to gain from ML and AI is going to be much, much greater. That these technologies will mean the difference between life and death for many of us is not an exaggeration. But, the nature of the data we will have to share in order to take advantage is also much more sensitive, creating unprecedented exposure for consumers. The interplay between both sides of this equation is going to drive both our adoption (the willingness with which we’ll share our most personal data), as well as the further evolution of privacy-protection methods, such as those described above. In that regard, Privacy-Preserving ML is very much in its infancy. The work to date on these fronts has been done almost entirely by researchers based on their own speculations about the nature of likely attacks or breaches. Unlike, for example, the state of protection we enjoy from computer viruses today (which draws on decades of real-world attacks), we have no idea what the “bad guys” will actually do. We’ll have to wait and see, and then learn, improve, and catch up. Further reading: A Major Drug Company Now Has Access to 23andMe’s Genetic Data. Should You Be Concerned? About Jack Mardack Jack is the VP of Marketing at Actian.