Data Mining Spurs Innovation, Threatens Privacy
IRA FLATOW, host:
Up next, if you have a smartphone in your pocket right now, think about how much it knows about you and how much your phone company could learn from it. The phone's GPS tells it where it is every night, which is probably where you live. The GPS capability could track your movements to the city. Matched with timestamps, you could possibly - it's possible to see how fast you're traveling. And if you're walking or driving or riding a bus, it could even tell others which congested highways to avoid from your unfortunate vantage point, being right there in the congestion.
Of course, we've all been there and done that, and we just heard from Sid talk about how, you know, these folks twittering about - using your GPS to tell folks where the earthquake is happening.
Well, computer scientists are thinking big thoughts about what else they could do. What could they do with all that crowdsource data? That's a new term, I think, that's entering the lexicon: crowdsourced data. So keep your ears open for it and ways to make all that real-time information more useful to you and to everyone else. And not just mapping traffic patterns across the Bay Bridge, but monitoring air pollution all over the city, providing better health care to patients.
Will it be possible, though, to get benefits like these without sacrificing your privacy? Because you're going to be giving away, with your cell phone, a lot of information about yourself. And how can we keep the data safe and anonymous if it's being used by third parties?
And that's one of the topics being discussed in Science magazine this week. And my first guest was the author of that piece. Tom Mitchell is the Fredkin University professor and chair of the Machine Learning Department in the School of Computer Science at Carnegie Mellon in Pittsburgh. He joins us from WQED in Pittsburgh. Welcome to SCIENCE FRIDAY, Dr. Mitchell.
Dr. TOM MITCHELL (Professor, Fredkin University; Chair, Machine Learning Department, School of Computer Science, Carnegie Mellon University): Thank you. Good to be here.
FLATOW: You're welcome. Deborah Estrin is a professor of computer science at the University of California Los Angeles. She's also director of the Center for Embedded Network Sensing there, and she joins us from NPR West in Culver City. Welcome to SCIENCE FRIDAY, Dr. Estrin.
Dr. DEBORAH ESTRIN (Professor of Computer Science; Director, Center for Embedded Network Sensing, University of California Los Angeles): Delighted to be here.
FLATOW: Can you give us, Dr. Estrin, how many cell phones are out there now?
Dr. ESTRIN: There are billions, multiple billions of cell phones out there. Not all of them are smartphones. I think the latest statistic is about one of four phones that are being purchased these days in the U.S. are smartphones.
FLATOW: Mm-hmm. Do you need a smartphone for us to gather data from it?
Dr. ESTRIN: Actually not. We maybe got the ideas from smartphones, but we see from Twitter - as you were just describing - the minute you send a text message, it's time stamped. And already, you have the beginning of a - as a scientist would say - a time series.
FLATOW: Mm-hmm. And Tom, what kind of projects - I mentioned this geological survey idea, what kind of projects already are using this technology, and what kinds of things might we imagine down the road?
Dr. ESTRIN: Well, I can't pass up mentioning - oh, sorry.
FLATOW: Go ahead. You can go, Deborah, and then I'll let Tom jump in.
Dr. ESTRIN: I can't pass up mentioned that Martin Lukatch(ph), a Ph.D. student of ours, was the first one to capture, that I know of, exactly as you suggested, with that iPhone - only it wasn't an iPhone, it was a Nokia device with the accelerometer, and captured data on an active earthquake and then ran it though some seismological analysis, and it was an exciting example.
I don't think it's the most prevalent example. We see a lot going on in increased citizen science, civic engagement. Really, everyone can now be systematic and scientific to make their local and smaller issues documented.
FLATOW: Well, we have to go to the break. So Tom, I'm going to jump in here. But, you know, some of the trivial uses that we have today that I mentioned before is now you can now check traffic, you know, where - if you're on the right road going into the city because all these other people are tweeting and also entering data into these smart-apps that are on the phones now.
But we're going to come back and talk more about what more futuristic things might happen with Tom Mitchell and also with Deborah Estrin. So stay with us. We'll be right back. Our number: 1-800-989-8255. What kinds of things would you like to see your cell phone do? Or maybe you're afraid that your cell phone knows too much, knows too much about what you're doing and the records that you're sending out. Stay with us. We'll be right back after this break.
(Soundbite of music)
FLATOW: I'm Ira Flatow. This is SCIENCE FRIDAY from NPR News.
(Soundbite of music)
FLATOW: You're listening to SCIENCE FRIDAY from NPR News. I'm Ira Flatow. We're talking this hour about mining data from your cell phone and the Internet and some of the privacy issues with my guests: Tom Mitchell, he's at Carnegie Mellon University. Deborah Estrin - she's from UCLA. Our number: 1-800-989-8255.
Tom, give us some idea of some of the potential that this allows us, what potential openings we have to use this data.
Dr. MITCHELL: Well, there are several already in use. You mentioned, for example, using cell phone GPS data to monitor traffic congestion, which could have a huge impact in terms of pollution, congestion, productivity, traffic.
But one of my favorites these days that's in use is the Google Flu Trends site. What they do there is they just capture the search queries that millions of us make to Google, and they look for search queries that seem to be related to influenza-like symptoms. So if you, you know, search for the nearest pharmacy, for example.
And they monitor that and digest that into a summary, which turns out to be a pretty good reflection of the influenza outbreak frequency that the CDC reports. And, in fact...
FLATOW: You mean by seeing what drugs are coming out of the pharmacies?
Dr. MITCHELL: By seeing what search queries people are making. So if you come down with some flu-like symptoms, you might go search for something about drugs or nearby pharmacies. And it turns out that when you integrate the millions of such queries, you can get a pretty interesting map of the ups and downs of incidents of flues.
And people have looked at that and related it to the CDC reports and found that it's actually a pretty good indicator what the CDC is going to report a week later.
FLATOW: Wow. And so - and it's no cost to anyone.
Dr. ESTRIN: That's right. In fact, there's no even, you know, aware involvement by any of us in contributing to this. It just happens to be data that's there.
FLATOW: Mm-hmm. Deborah, what else are you working on?
Dr. ESTRIN: So I - it's a really interesting spectrum here because what Tom just talked about was where you have really large aggregates of information and what can you mine from it. And we've been looking at applications that really scale down, as well as scale up, meaning there's some return on it, even if just one or two or small numbers of people make use of it.
So we have applications. Your listeners can go to whatsinvasive.com and see an application we built for the National Park Service, where the employees of the National Park Service or hikers can go around with their smartphone. And when they see one of the top 10 invasive, most-wanted, invasive weeds, they can take a quick picture of it, tag it, and it becomes - automatically goes up as a geo-coded piece of evidence for the National Park Service as to where they want to focus the eradication in order to protect the local flora and fauna.
We have examples in the health arena, as well, where it's not so much that you're looking at what's going on across a whole community and doing some kind of community-wide epidemiology, but just imagine an individual who's had some adjustment in their drugs - whether it's medication for high blood pressure, diabetes, antidepressants - and it's very difficult to get the feedback on knowing exactly - it takes time. There's delay in the system. How am I actually doing? How is it affecting my sleep patterns, my appetite, my anxiety level? And really, if you think of it as having people sort of Twitter to themselves and generate this little time-stamped record of they're doing, correlated with that adjusted medication.
So it's interesting because it's at the other end of the spectrum. It works even just for that one person doing it.
FLATOW: Question from Second Life for Lorgina(ph). It says: How do crowd outsourcing deal with ideas about data ownership? I mean, if I'm taking these pictures and I'm sending you this information, who owns that tweet, or who owns that photo? Or how do I make sure that, you know, that it goes to the right place? How do we deal with those issues?
Dr. MITCHELL: Well, the privacy laws vary greatly as you go from continent to continent, and I think it's fair to say - Deborah can chime in - but I think it's fair to say in this country, it's a little bit of the Wild West right now.
We have privacy laws that were developed before we had this kind of pervasive electronic record of our individual lives, and they were okay for that time, but now it's really a new era. And I think it's time to rethink those.
For example, in Europe, there's kind of a guiding principle about privacy that says data, personal data about you, is controlled by you. It might be housed by somebody else, but its uses and so forth are largely under your own control. We don't have that here.
FLATOW: We don't have that.
Dr. ESTRIN: It is a really interesting time because, often, people point to, for example, the cell phone carriers. You did a ditty in your introduction that they have so much information about you. But they are a regulated industry.
Now, we really have so much information about ourselves that we can easily opt in and give away. And so it's an interesting time. And it's about raising people's consciousness and also giving them mechanisms to opt in to giving certain information, or more anonymized information or statistical features of that information instead of just having opt-in be an all-or-nothing.
FLATOW: How do we prevent gaming, what I call gaming of the system? You know, people who are just out there to wreck the data. They do that, you know, with Trojan horses or, you know, with any kinds of things. How do we know that the data that's coming back is not just being made up? People like to do that sort of thing, you now.
(Soundbite of laughter)
Dr. MITCHELL: The thing is, it's hard, if you're looking at millions of individually contributed data points. Even if you have a few people who are dishonest, they're going to be overwhelmed by the other million data points. So in many of these applications, it really would take a gang of liars to mislead the system.
Dr. ESTRIN: And I think that's one of the things that's interesting about some of these applications that initially feed back to the individual, because if you're looking at your own data as you're passing it on to your clinician or even as you're contributing to the National Park Service, you have that ability to do a sanity check on your own data. And also, as Tom says in his Science article, the same machine learning techniques that make us able to pull so much information out of this data can help us to identify the spam and the low-integrity data.
FLATOW: Tom, is it possible to make the data anonymous, to strip out all the identifying features that would tell us who it is?
Dr. MITCHELL: The answer is that it depends on what you want to use that data for. So if you think, for example of the Google Flu Trends example that I brought up a moment ago, that's perfectly fine in terms of anonymous data. The search queries, Google, just, you know, doesn't have any personally identifying information that they report about those search queries other than the general community, physical location where they came from. And so there are applications like that, or traffic congestion, that can be done with totally anonymous data.
But there are other kinds of applications, including some that I think would be tremendously beneficial to society, that require more personally identifying.
FLATOW: Well, here's an interesting tweet that came from MTatter(ph) to scifri, and our tweet number - identity, so to speak - is @scifri, @-S-C-I-F-R-I. And here it is, interesting: I wonder if law enforcement will soon be sending subpoenas to nearby potential witnesses to crimes.
Dr. ESTRIN: So I've worried about this a lot because over the past couple of years, we've been having the technical staff and faculty and students who are experimenting with this, we, many of us are continuously recording our GPS traces. And while we largely keep that data private to the individual, it is actually available by subpoena.
And so there is this - such an interesting duality about how telling these traces are because they're so compelling and easy to gather and so much you can infer from them and learn from them about your own patterns or community patterns. And on the other hand, they can be very telling in a privacy-concerning way.
We've been trying to introduce an approach that we refer to as a personal data vault so that your - that raw, personal data stream about your time, location, the little Twitters you do about your latest check on your blood glucose or your symptom, goes into a personal data vault that you own, and then you selectively release. Because as Tom was saying, if you're trying to use some of this for personal use, you can't actually anonymize it.
FLATOW: But if you're near a crime or a scene of a crime and law enforcement wants to find you, they can get access to the database?
Dr. ESTRIN: They can subpoena it.
FLATOW: They can subpoena it.
Dr. ESTRIN: As they could your diary.
FLATOW: Does any of this bother you, Tom?
Dr. MITCHELL: Oh, absolutely. In fact, the issue of subpoena is very interesting because those laws and those legal precedents were established again in a different era than we have today, when there was much less data that was permanently recorded about their lives, and you know, through some analysis and tradeoff, the legal system came to a decision about what was subject to subpoena: written records.
Now we have, effectively, electronically written records about our, you know, very detailed movements, phone calls, conversations. And it may be time to just reassess that legal trade off.
FLATOW: 1-800-989-8255. Holly in Greensboro, Florida. Hi, Holly.
HOLLY (Caller): Hi.
FLATOW: Hope you're surviving the weather today.
HOLLY: Yeah. At least it's a warm rain.
(Soundbite of laughter)
FLATOW: You got a question for us?
HOLLY: Well, y'all have been talking about security, and that is probably the thing that concerns me most. I just received information from BlueCross BlueShield of Tennessee, which used to be my insurance carrier, that said some of their records had been stolen and now I was subject to identity theft. Nothing in your discussion of security has made me feel more secure that that would not happen with these smartphones, which I do not own one because I'm in the neo-luddite. So I guess that's my comment. And until they find a way to do something about computer security, I'm not sure I like all of this stuff.
FLATOW: Hmm. If they can't secure your computer, what are they going to do with your phone?
HOLLY: Yeah. I mean, really.
FLATOW: Yeah. A good question. Let's go to...
HOLLY: And if an insurance company can't secure its computers, how am I, the classic neo-luddite, ever going to secure my own? And I'll take my answer off the air. Thank you.
FLATOW: Thanks for calling. Good question, huh?
Dr. MITCHELL: I have to say I agree with the caller on this one. I've received similar letters myself from my bank. And, you know, really, it's different. We're used to thinking of private information that we tuck in our desk drawer as under our own control. If it's stored elsewhere...
Dr. MITCHELL: ...it's not.
FLATOW: Mm-hmm. Deborah, you have any comment?
Dr. ESTRIN: It's clearly a really difficult question, and there's so much already out there. I do think that the more we can know where our data is and have clarity about how it's being used and who has access to it, what's in sort of private repositories in the way that our assets are in our bank account, we can get a little bit more control over it. But fundamentally, yes, clearly, the nation faces cyber security issues.
FLATOW: Are we talking about storing all this data in a central depository or some - is it scattered around? And would it be safer if it were scattered around?
Dr. ESTRIN: Yes and no.
Dr. MITCHELL: You know...
Dr. ESTRIN: Go ahead, Tom.
Dr. MITCHELL: In a way, it is scattered around right now, right? If you think about all of the information about you, some of it's with your phone company. Some of it's with your grocery store through your loyalty cards. Some of it - you know, it's all over the place. And, in fact, it'll probably continue that way because I don't think people would put up with, honestly, at least a nationwide effort to integrate - to aggregate that data.
But there's a very interesting technical possibility of doing data mining on data that's actually stored inside different organizations and getting the same kind of effect that you would get in terms of data mining - get the same effect that you would if you could aggregate it into a central repository. But instead, leave it in individual organizations and then use what people call privacy preserving data mining methods to essentially just do the data mining in a distributed and cryptic - encrypted way.
FLATOW: I just have to interrupt and remind everybody that I'm Ira Flatow, and this is SCIENCE FRIDAY from NPR News.
Dr. ESTRIN: And, of course, there is that - as Tom was pointing out, there's institutional distribution and there's physical distribution. And there are different forms of safety. You sort of want your data physically distributed so that it's robust in case there is, you know, a hardware problem or an earthquake, for example.
Dr. ESTRIN: At the same time, people like the idea of some level of institutional distribution for the reason Tom mentioned. And yet we hear about all the time the increasing consolidation in the media business and great benefits that come from consolidation in insurance and economies of scale to the concerned industry. So it is an interesting time for those questions. I think everybody looks for approaches that can be distributed and yet somehow take advantage of aggregation.
FLATOW: Do we have enough data out there now? I mean, you mentioned about all the billions of cell phones. Is there enough data coming in that we can do these things without having to get more data coming in? Deborah.
Dr. ESTRIN: Well, I think what's - it's about actually getting more people engaged. So it's not that it's - you're - that there's a lot more we can do to help people with their personal health management, preventative health management, engaging citizens. It's about really having an informed and engaged populous, if you will. And I don't think anyone would claim that we have enough of that going on.
FLATOW: Mm-hmm. Tom, you agree?
Dr. MITCHELL: Well, I think we will have more data in 10 years. There's just no way around that. It's an unstoppable trend. But I think what's not as predictable is how clever and effective will it be at using that in ways that we all agree are useful. If I can give you one example of something we are not doing today, but which I really hope we are doing within five years, it's this: Consider the GPS data you get from your cell phone and imagine, for example, that I were to walk into the emergency room this afternoon and I'm diagnosed with H1N1, swine flu.
By looking at my GPS cell phone records and other people's GPS cell phone records, we could figure out who are the people who were physically close to me in the past three days and they could be called automatically and be told that, oh, Tom, by the way - you might interested in knowing, since you were around him - just came down with the swine flu. Now, that's a service we could provide if we integrated data which is currently in the phone company organization and in the health care organization.
FLATOW: On the other hand, if you walked in and I diagnosed you with a venereal disease, you might not want to put that into the public.
Dr. MITCHELL: Exactly. And so there are questions of who gets access to that. You know, and the thing is these decisions are being made everyday right now without the technological backup.
Dr. MITCHELL: So when somebody does get diagnosed, maybe they'd make a personal decision about who to call and tell.
Dr. MITCHELL: But at least they could have the backup to know...
Dr. MITCHELL: ...who were the people affected.
FLATOW: All right, we've run out of time, but we've opened the door to this topic, which I'm sure we'll be following in months to come. Tom Mitchell is professor and chair of the Machine Learning Department in the School of Computer Science at Carnegie Mellon. Deborah Estrin, professor of computer science at UCLA, also director of the Center for Embedded Network Sensing there. Thanks for you attention today, and have a good holiday weekend.
(Soundbite of music)
FLATOW: We're going to take a short break, come back, switch gears and talk more about technology. So stay with us. We'll be right back after this break.