Wednesday, August 16, 2006

the AOL debacle and the user search records

recently AOL majorly goofed and released poorly anonymized search records of hundreds of thousands of their users. the records contain over 30 million queries. you can see some news about it from google news at this link.

just to help people understand this: imagine if all your google queries for the past year were made publicly available. ie, for user bob, i would assign a number, say 42, and for each of his queries i publicly release that user 42 queried for such and such. so if you did 10000 queries over the past year, i know those 10000 queries (though i don't know that your name is bob, unless you queried for your name, or unless i do some digging). NOTE: it's actually more than just that.. there are click records as well.

i posted a comment about my problems with how individuals are treating this data over at a friend's blog, and i'm just so riled about all this that i'm going to post most of my comment here.

here's one problem: if it's clear that someone was doing some serious personal abortion research in these queries, i wouldn't say to my friends "look look, look at what this person is doing..." especially when the data is poorly anonymized (ie people can figure out where this person roughly lives, and they can use all their other queries to improve understanding of the user's context).

here's my thing.. i don't think people think of these records as they perhaps should think of them: they are like medical records. extremely personal! if suddenly a bunch of medical records were released on the internet, would you go scouring through them and post that a person in minnesota has a rare defect which causes him to fart whenever the word "gas" is said? it's funny, sure, but it's very personal.

here's a quote from the Technology Review blog which, if true, i find really irresponsible:
At the same time, though, other people -- Internet researchers, statisticians, sociologists, and political scientists -- silently cheered.
scientists, especially social scientists, work so hard in their studies to get consent and gather data ethically. now this gift is dropped on them: they can use it, but they should really think about how they can use it responsibly. for instance, there's a site which now lets you search over the AOL logs, and they will remove data if someone determines that the data is personally identifiable. but i really think that's the wrong way around: when such sensitive data is going to be made public, unless the anonymization is fantastic (which might not even be possible!) you get consent before revealing any of it, not the other way around. and poor anonymization, and no consent, are what we have with this data.


on the wall street journal web site, there's an interesting online discussion going on that was prompted by a posting of a discussion between a lobbyist for internet firms, and a lawyer with the electronic frontier foundation (eff). in a response to the lobbyist, the eff lawyer writes:

Mr. Bankston responds: Markham, you wrote that "companies, with feedback from their users, are in the best position to determine how long such data should be kept." I'd like to think that the market could take care of this problem, and that companies insensitive to privacy concerns would be punished by the market. But the only way that'll work is if consumers actually have enough information about companies' practices to make rational choices about which search engines to use, and enough information about how the law protects their online privacy. On both scores, however, the consumer is completely lacking in information.

i completely agree with this and hope internet firms become more transparent.


James said...

I think the real question that we should ask is "how private is your search query?". I disagree that seaches should be accorded the same privacy as medical records. Most of the searches are generic and innocent enough that they are valuable sources for market research. There are however corner cases where this assumption fails, as we saw. Maybe this incident is a step forward in answering this question.

On a related note, I don't know how people choose which search engine to use, but my guess is that the privacy policy is probably very low on the list. A search engine that has the strictest (percieved) privacy policy but only has average relevant results probably won't fly.

omar said...

this misses the point. certain medical record information is also released at times for studies. all information can be released, with the proper consent. for instance, how do you suppose we know the incidence rate of HIV/AIDS in a country? all kinds of medical statistics are gathered, even without consent! namely, aggregate statistics.

however, if someone told me that they'd wipe my sensitive medical diseases, and only release information on my inconsequential medical visits, i'd still be pretty peeved, because i think those visits are private. similarly, if you told me you'd toast my sensitive queries, but leave in my queries for innocuous things, i'd still be peeved. why are you associating this data with me at all? why a trail of my non-important queries?

i think your question already gives up the hope for real privacy with my online search activity. who is going to make the decision on what is private and what isn't?

privacy policy is low on the list because people don't even understand what is going on. when i tell people that, btw, the google toolbar autoupdates itself without your consent, and did you know that advanced features in it sends your url to google, they completely freak out. and the ones that don't say "privacy is dead." depressing either way.