Data Privacy in the Age of AI
At present, AI accessibility and capabilities surpass our laws to regulate it and the use of our private data. Here are a few ways that AI might have a negative effect on privacy until laws catch up.
Privacy concerns have been around long before the internet came into being. Threats to privacy deepened as common usage of the internet made data more accessible, and its use more flexible. But the previous threats pale in comparison with the possibilities that the AI explosion has created.
Here are some--but by no means all--potential privacy concerns associated with AI systems, contrasted with what they once were prior to generative AI models like OpenAI™ and ChatGPT:
Collection of Personal Data – Data gathering in some form or another has been around as long as commerce has existed. Personal data collection has become more and more pervasive. For example about 10 years ago I bought 5 boxes of newborn diapers. Each was a gift for friends all having babies at the same time. As I paid, the cash register began spitting out a number of coupons for things like baby formula, baby lotion and baby blankets, clearly because of the link between purchasing diapers and purchasing formula. This was a form of machine intelligence that was tracking what I was purchasing with the aim of getting me to purchase related items as the same retailer. That was over a decade ago. Back then, this type of data tracking and analysis was only available to those that had exceptional wealth—such as large corporations--or governmental power. With more data, better analytics and algorithms, increased computing power, data scraping, information tracking, and the wide accessibility of AI, came a large price tag—less data privacy. The amount of privacy I lost because of the information gathered by the shopping app or retailer’s rewards program that allowed the retailer’s system to know what I needed at the store before I did, looks completely innocuous compared to the privacy loss we experience each and every day. My seemingly innocuous transaction exposed a deeper tradeoff; my privacy decreased as the corporation added details to their profile of me.
For example, let’s imagine I want to track everyone who used a taxicab to travel to Katz’s Delicatessen™ in New York City. I’m not sure why I would want to do that, but go with me on this illustrative example. And besides, their pastrami sandwich is delicious. Before generative AI my process would look like this. I could assume that everyone being dropped off by a taxi within 200 feet was headed to Katz’s famous deli. To track who traveled there by taxi I would first need to figure out the location parameters. Not too difficult. Then I would need access to taxi information. What taxis were taking people to Katz’s? Where did they originate? Who were the individuals paying for the taxi ride. This type of information would likely only be accessible with a court order or subpoena. Next I would need an extensive knowledge of computer programming. I would need to spend hours and hours writing computer code. Then I would need a way to feed all of the information into the computer program—perhaps hire a data entry specialist if I didn’t have an electronic version of the information. And then I would need hardware sufficient to execute the program. That’s thousands of dollars in cost, unless I’m a computer genius, which I’m not. Probably not worth the time or effort.
Compare that with the same task using today’s technology. With free and low-cost AI models like ChatGPT that are accessible to nearly all people, my task would be far more simple, cost little to no money, and I could extend the scope of my task easily. First, data gathering would be much easier to obtain; New York taxi information has been available online for the last few months.[1] (Thanks Big Data.) I wouldn’t even need to write a program to have a computer execute that one task. I could just copy all of the taxi information into ChatGPT and tell ChatGPT that I wanted a list of all of the taxi riders dropped off within 200 feet of Katz’s. With a little bit more AI savvy I could have ChatGPT or another AI model use multiple data sets so I could obtain personally identifiable information (PII) such as actual names and home addresses of the people who took a taxi to Katz’s. That would take about 5 minutes; ten tops. For fun, I could have an AI model analyze the data and find patterns: Do some people go once a week? Are there 4 friends that meet there the first Monday of each month? I need only give the appropriate prompts to an AI model that had scraped large amounts of data off the internet. And the most astounding part is that all of this could be done in under an hour with an inexpensive laptop. Of course this is a fun, harmless example, but you can imagine more nefarious uses that people might make of such capabilities.
The requirements of time, money, knowledge, and a physical location no longer exist to act as natural deterrents for someone to obtain that list of guests taxied to Katz’s.
Security Risks – Private data being mishandled and misused is likewise not new. The town gossip or telephone operator might have announced to everyone they spoke with who was expecting a baby, who was cheating on their spouse, and who had financial or medical trouble. With increased technological capabilities came increased opportunities to obtain a far larger amount of personal information. Before generative AI, if you suspected that you were a victim of this kind of breach you would worry that someone would assume your identify and steal money from your bank account, or obtain loans using your Social Security Number and good credit score. You would need to notify your state’s motor vehicle department. You should also change all of your passwords, get a new credit and debit card, work with the IRS to get a special PIN number for purposes of filing your taxes, and a few other steps. Scary, frustrating, time-consuming, but manageable.
Now, with generative AI models, if your record containing your social security number is stolen, along with your name, address, phone number and probably payment information from a credit card, the options of what a wrongdoer could do with that information are much more broad. For example, consider donotpay.com, a site that “uses artificial intelligence to help you fight big corporations, protect your privacy, find hidden money, and beat bureaucracy.” With this site, users provide a basic set of information and the site will manage the cancellation of unwanted subscriptions. Now, consider the nefarious version: a few data points, some readily available AI, and a wrongdoer could steal assets, enter into unwanted agreements, deceive family and friends, and otherwise wreak havoc in your life. The technology for both is roughly the same, but the consequences are extremely different.
Informed Consent – In years past companies were not held to a standard requiring consent for obtaining and using our personally identifiable information (PII). And when consent was obtained it impossible to verify that the company used your data for the stated purpose. Many people are aware of how Facebook let Cambridge Analytica scrape data without informing Facebook users that this information would be used, and how it would be used. Cambridge Analytica then used this data for political purposes.[2] But there are special privacy concerns when it comes to how current AI models might misuse this information.
There are two distinct categories of informed consent in the context of data and AI. First, users may not fully understand how their data is used for the construction and training of AI models. This includes any personal data that was included in the foundational training, and any number of ways your PII data could have been exposed to the AI model or its human operators. Second, and perhaps more challenging are individuals or entities that use AI as a component of some system. This could be as complex as using medical records to make predictions or as simple as a website harvesting cookie data to profile user preferences. With the passage of laws like the CCPA[3] we at least have the hope of limiting tracking cookies, and that requests to destroy our PII will be honored. But this is not foolproof. It’s too easy for consumers, especially young or inexperienced ones—to click the “Agree” or “Accept All” button without even considering what that means to their data. Especially as that data now has a permanence that it didn’t before (see the section on Data Persistence, below). This how common practice of blindly accepting usage agreements could create a veneer of legality to data harvesting practices.
Cross-Referencing or “Reconstituting” Data - Prior to AI models being so accessible, data could reasonably be anonymized. For example if an insurance company wanted to determine the driving records of people in car accidents or who received speeding tickets in any given jurisdiction or by age during a certain time period that information could be obtained. But unless you were a client of a specific insurance company, the companies would only have aggregated cohort data.
Now with the advent and accessibility of AI models, AI can identify patterns and make inferences. Most damning, AI can combine data from disparate sources in new ways. For example, 30 years ago a fingerprint was the only common biometric. Now, DNA can be easily scanned and facial and retinal scans are widely used. AI can even “fingerprint” a person by watching how they walk. Even data that has been anonymized can be combined with another data set or two and be “un-anonymized” or what I call Reconstituted Data. This reconstituting of data could render informed consent meaningless. And that is to say nothing of data used without consent. Once reconstituted, this data would allow users to discern protected attributes like health conditions, ethnicity, sexual orientation, personal habits, business information, and political or religious beliefs, all from data patterns. This data could in turn be used for political campaigns, insurance rates, interest rates on bank or government loans, school admissions, and various forms of more subtle discrimination. De-anonymizing data and re-identifying individuals is increasingly possible with AI, as is quickly and cheaply discovering patterns using that data. There is a chilling example of this from China[4]. In 2022 a hacker offered 23 terabytes of PII data covering millions of Chinese nationals, and even foreign tourists who had visited Shanghai. This data included details such as name, passport information, addresses, and even food delivery orders. This data was reportedly stolen from the Shanghai police authority. It is not hard to see how such a dataset could be combined with AI for mass exploitation.
Imagine that a fictitious person—Belle--has a congenital propensity for a certain type of colon cancer. As a minor her parent takes her to the doctor and fills out a health questionnaire as required by the medical office. The parent lists the congenital propensity along with information about a relative with diabetes and also information about anxiety and depression that are prevalent in her extended family. Of course, the medical provider needs this information to give Belle the best healthcare. This information is likely aggregated and anonymized, and given to a governmental entity or even sold to a large insurance company. No worries, because no one will know all of that information about Belle except the workers in that office. Except that they will. Lots of different companies, entities and even individuals could discover that information. It would only take a few prompts to have all of that anonymized data combined and compared to other personally-identifiable data.
Sound farfetched? Researchers at the Jinling Institute of Science and Technology have examined AI and machine learning (ML) technology to improve cancer prognosis and treatment selection[5].
And once data is Reconstituted it can be used by a medical insurer to charge higher premiums or deny coverage. Likewise it could be used by a life insurance company. That information could also be sold to a “data broker” and then obtained by Belle’s prospective employer. If a company reviews PII it might decide not to offer Belle a job because the employer worries about the possibility of Belle missing work due to anxiety, depression, or cancer treatment. What if Belle wants to run for political office? Or be named paster at her church? There might be information in Belle’s Reconstituted Data that can be used to Belle’s detriment. Each of these ancillary uses is an example of so-called Function Creep[6] where data gathered for one purpose is used for entirely different purposes. Function Creep is especially worrisome since anonymized or “cleaned” data can be Reconstituted.
These are not far-fetched hypotheticals; many AI creators will not even disclose the amount or nature of the data used to train AI models. There are constitutional and legal protections when discriminating against someone because of their race, gender, nationality, or their marital status. But whether or not Belle is offered a job is not protected by the U.S. Constitution or other laws. Unless Belle can prove a causal connection between her failure to get the job and some protected status such as gender, age or race, she will have no recourse.
Data Persistence – Before the internet we had the hope that our PII would be destroyed due to lack of desire to keep the information, and very real physical space constraints. I worked at a law firm as a file clerk while in college. The firm rented a large storage shed to keep boxes of physical documents for our firm’s cases. The boxes were maintained by year. We kept the files for a certain number of years. Then once a year office staff had the job of clearing out the boxes with the files we were no longer keeping, and making sure any documents were shredded.
Shortly after I began practicing law my firm adopted an electronic data storage system. Some physical evidence and documents were retained and then disposed of at the end of a holding period, but we had an electronic copy of every single document, item, receipt, exhibit, CD, picture or other item. It made working from various locations much more manageable. It also made accessing information much easier. I can still pull up cases that I worked on decades ago.
Now, let’s go back to our example of Belle. The medical office she visited almost certainly has a data storage provider. There is no guarantee that the data, including Belle’s data is not fed into a data-hungry transformer and used for AI purposes.
One of the biggest problems surrounding data persistence or permanence is that it is not entirely clear where all the data fed into transformers to train the AI models has come from. To stay current and useable, these AI systems will continue to need a great deal of information. Some of this data was and will be obtained with consent. But some was anonymized, which can be undone, and some did not have clear limitations or consent. Further, the lack of informed consent discussed above is compounded by the fact that data used in AI model training and fine tuning has a persistence to it that can survive our maturity, our change of heart in regard to informed consent, and even our deaths. Laws, rules and regulations will catch up to this use and misuse of data, but it is naïve to think every bit of PII will be stripped from AI and the persistence of the internet. And data persistence contributes to Function Creep, as discussed above.
Law Enforcement Concerns – Most people would likely agree that technology such as body cameras and car cameras have made law enforcement better for all. But AI can enable new forms of automated mass surveillance, enhanced facial recognition, behavior analysis, and predictions on whether a person convicted of a crime will re-offend.[7]
This will be examined in much greater detail in a subsequent article, but here is a brief example. Currently most law enforcement vehicles are equipped with an internet-connected computer. Before pulling you over the officer can run your plates and get information on the registered owner of the vehicle. They can search for outstanding warrants and check on whether and when citations were issued.
With AI, these functions will be increasingly automated. Instead of typing a license plate number into a computer, an AI enabled camera will scan not only the plate but the entire car, determining the make and model and comparing that data to the online registration. The police body camera footage will be analyzed, and faces recognized and cataloged along with GPS data. The correlation of data described in the sections above will make policing more effective while creating the infrastructure for more involved surveillance.
AI has vast capabilities for data analysis, but this is not meant to be an alarmist article. Strict data governance, security, testing, and oversight is important. As a society we are behind in making laws and rules that will provide for the safe and fair usage of AI and PII. But as has happened in the past, laws will catch up. Penalties for misuse will be fashioned and enforced. And technology providers will incorporate more sophisticated safeguards. Until then, it is important to safeguard how your PII is collected, managed and, possibly, exploited. We are only on the cusp of realizing all the ways that AI can make our lives better, and also how it might make our lives worse--sometimes in lasting ways.
[1] https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
[2] Facebook (through its owner Meta) agreed to pay $725 Million in compensation to users for allowing their data to be used without their consent after being investigated by the Federal Trade Commission. About 50 million Facebook profiles were harvested and used to sway voters in the 2016 presidential campaign. See https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election; https://www.ftc.gov/legal-library/browse/cases-proceedings/182-3107-cambridge-analytica-llc-matter (FTC lawsuit against Cambridge Analytica, LLC)
[3] California Consumer Privacy Act, which has given us, among other things, the choice of whether we are willing to accept the website owner’s tracking cookies on our computers. https://oag.ca.gov/privacy/ccpa#:~:text=This%20landmark%20law%20secures%20new,them%20(with%20some%20exceptions)%3B
[4] https://www.forbes.com/sites/zengernews/2022/07/26/what-every-ceo-needs-to-know-about-the-shanghai-data-breach/?sh=45a1d4b9570e
[5]https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312208/#:~:text=In%20fact%2C%20AI%20and%20ML,various%20illnesses%2C%20not%20just%20cancer.
[6] https://www.tandfonline.com/doi/full/10.1080/17579961.2021.1898299
[7] These issues will be addressed in greater detail in a subsequent article.