just to help people understand this: imagine if all your google queries for the past year were made publicly available. ie, for user bob, i would assign a number, say 42, and for each of his queries i publicly release that user 42 queried for such and such. so if you did 10000 queries over the past year, i know those 10000 queries (though i don't know that your name is bob, unless you queried for your name, or unless i do some digging). NOTE: it's actually more than just that.. there are click records as well.
i posted a comment about my problems with how individuals are treating this data over at a friend's blog, and i'm just so riled about all this that i'm going to post most of my comment here.
here's one problem: if it's clear that someone was doing some serious personal abortion research in these queries, i wouldn't say to my friends "look look, look at what this person is doing..." especially when the data is poorly anonymized (ie people can figure out where this person roughly lives, and they can use all their other queries to improve understanding of the user's context).
here's my thing.. i don't think people think of these records as they perhaps should think of them: they are like medical records. extremely personal! if suddenly a bunch of medical records were released on the internet, would you go scouring through them and post that a person in minnesota has a rare defect which causes him to fart whenever the word "gas" is said? it's funny, sure, but it's very personal.here's a quote from the Technology Review blog which, if true, i find really irresponsible:
At the same time, though, other people -- Internet researchers, statisticians, sociologists, and political scientists -- silently cheered.scientists, especially social scientists, work so hard in their studies to get consent and gather data ethically. now this gift is dropped on them: they can use it, but they should really think about how they can use it responsibly. for instance, there's a site which now lets you search over the AOL logs, and they will remove data if someone determines that the data is personally identifiable. but i really think that's the wrong way around: when such sensitive data is going to be made public, unless the anonymization is fantastic (which might not even be possible!) you get consent before revealing any of it, not the other way around. and poor anonymization, and no consent, are what we have with this data.
on the wall street journal web site, there's an interesting online discussion going on that was prompted by a posting of a discussion between a lobbyist for internet firms, and a lawyer with the electronic frontier foundation (eff). in a response to the lobbyist, the eff lawyer writes:
Mr. Bankston responds: Markham, you wrote that "companies, with feedback from their users, are in the best position to determine how long such data should be kept." I'd like to think that the market could take care of this problem, and that companies insensitive to privacy concerns would be punished by the market. But the only way that'll work is if consumers actually have enough information about companies' practices to make rational choices about which search engines to use, and enough information about how the law protects their online privacy. On both scores, however, the consumer is completely lacking in information.
i completely agree with this and hope internet firms become more transparent.