Internet Search Trends and New York Times
ROBERT SIEGEL, host:
How well can technology really figure out what we're interested in? Well, here's one answer. It is a top 10 list. So if you can figure out what it represents. Number one: United Health Care. Number two: May 4th, 2009. Number three: cancer. Number four: China. Number five: Korea. Six: Obama. Seven: swine flu. Eight: modern love. Nine: India. And number 10: Maureen Dowd.
Those are the words and phrases most frequently searched by readers of NewYorkTimes.com in the last 24 hours. It's one of the measurements kept by Derek Gottfrid, who is senior software architect for the Times Web site.
Welcome to the program, Mr. Gottfrid.
Mr. DEREK GOTTFRID (Senior Software Architect, NewYorkTimes.com): Thank you.
SIEGEL: That first place item on the search list is United Health Care. What do you make of - there were more searches for that phrase in the past 24 hours on NewYorkTimes.com than for any other. I've searched for it - I can't find an article about United Health Care.
Mr. GOTTFRID: Right. So, we actually cluster all the related terms together. So it will bubble up things that are about health care, as well as United Health Care. United Health Care would be the top result within that cluster. So each of those items that you went through is actually a cluster. So there's a bunch of related terms.
SIEGEL: When you speak of clusters, clusters that are determined by somebody who monitors the searches, or by algorithm? How are you…
Mr. GOTTFRID: By an algorithm. So if you actually clicked on most searched, you'll see an expanded view of all the most searched terms. And then from there, each of the terms is linked. And when you click on one of those linked terms, you can see all the related terms. And those terms are the ones that are grouped together. It's an algorithm. It uses something called cosine similarity to determine how closely related these queries are.
SIEGEL: So one obvious one would be if people were searching several different columns which were all part of the "Modern Love" series, those would pop up as "Modern Love," (unintelligible).
Mr. GOTTFRID: Yep. And so, "36 Hours" is another column from our travel section, and so it's always "36 Hours in Portland," "36 Hours in Washington." And so those all get clustered together. On the cancer term, you'll see cancer, but then it's also brain cancer and throat cancer and prostate cancer. You know, recently, pink boxers was a popular term on the Web site.
SIEGEL: Yes, that was up on the top 10 last week.
Mr. GOTTFRID: Yeah. Well, and so that…
SIEGEL: Well, what was that about? Yeah.
Mr. GOTTFRID: That was the story out of Afghanistan that was widely covered, where we had a - I think it was an AP photo on the homepage of the New York Times featuring a young soldier in a firefight, wearing pink boxers.
SIEGEL: But, you know, I've been looking at this list now for - carefully for a week. I've glanced at it before that. May 4th, 2009 just hangs in there, you know, for a long time. Something was going on.
Mr. GOTTFRID: Yeah. I mean, the date's a really unique one. And one of the first things that we discovered that when we starting creating this list was we started seeing this kind of recurring pattern of dates. So I think, you know, April 30th or something - there's an April date somewhere far down on the list as well. But, yeah, dates for news, not surprisingly, really are important to users.
SIEGEL: And these are instances of people entering that particular date.
Mr. GOTTFRID: This is exactly what people have typed in.
SIEGEL: So, for reasons that shall remain mysterious, May 4th, 2009 was a point of curiosity with lots of people.
Mr. GOTTFRID: We would love to solve that mystery as well, if anyone could write us some insight, please do.
SIEGEL: Well, thanks a lot for talking with us, Mr. Gottfrid.
Mr. GOTTFRID: Thank you.
SIEGEL: That is Derek Gottfrid, who is the senior software architect for NewYorkTimes.com.
NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.