FLORA LICHTMAN, HOST:
J.K. Rowling's first Harry Potter book, "The Philosopher's Stone," came out 16 years ago. Can you believe it? Since then, 450 million Harry Potter books have been sold worldwide, and J.K. Rowling has become, of course, a household name. And so taking a page out of her own fiction, Rowling created an invisibility cloak. She wrote her latest book, "The Cuckoo's Calling," her first detective novel and second book aimed at adults, under Robert Galbraith. That was her pen name.
She wrote on her website recently that she was yearning to work without hyper-expectation and to receive totally unvarnished feedback. Well, that cloak was lifted when a reporter at the Sunday Times received a tip through Twitter. That reporter contacted our next guest to determine if "The Cuckoo's Calling" had the same linguistic fingerprint as Rowling's other work.
Patrick Juola is a professor in the Department of Math and Computer Science at Duquesne University in Pittsburgh, and he's also the lead architect for the Java Graphical Authorship Attribution Program. That's the program he used to solve this literary mystery. He joins us from WFPG in Northfield, New Jersey. Welcome to the show.
PATRICK JUOLA: Hi. Thanks for having me.
LICHTMAN: So we only have a minute before the break, but just maybe you can start with a rough sketch of your role in solving this whodunit.
JUOLA: Well, basically, Cal Flyn of the Sunday Times wrote an email and asked if I would look into the puzzle. And I've been working in this kind of authorship attribution for something like a dozen years, so I was happy to oblige her. I looked at several different books, and it came up that Rowling was the most likely of the people whom I was looking at, and so a likely author.
LICHTMAN: And this is a field called forensic linguistics? Do I have that right?
JUOLA: That's one of the names. Forensic linguistics covers - more specifically, forensic linguistics is when it gets into a court. If you want to call it stylometry, because this never got to a court, that'll work, too.
LICHTMAN: Forensic stylometry is the other version. Well, we're going to hear all of the details of how this mystery was solved when we come back from this break.
(SOUNDBITE OF MUSIC)
LICHTMAN: This is SCIENCE FRIDAY, from NPR.
(SOUNDBITE OF MUSIC)
LICHTMAN: This is SCIENCE FRIDAY. I'm Flora Lichtman. We're talking this hour about forensic stylometry, and my guest is Patrick Juola. He's a professor of math and computer science at Duquesne University and the author of the Java Graphical Authorship Attribution program. OK. So walk us through it. You actually - you compared this new book, Rowling's new book - which no one knew was her book - to a few other texts.
LICHTMAN: Is that right?
JUOLA: Yes. Well, it's not quite fair to say that no one knew, because I think that Cal Flyn, the reporter from the Times, had a pretty good idea, because she'd done a lot of background research herself.
LICHTMAN: OK. Right. So a select few knew. And her publisher knew...
JUOLA: Obviously. And, of course, she knew. But the idea behind stylometry is that everyone has a particular way of writing that's almost impossible to hide. You know, everyone learns a slightly different version of language, which is why we Americans drive trucks and our British colleagues drive lorries.
But there are some other examples. If I asked you, when you set the table, where do you put the salad fork, what would you tell me?
LICHTMAN: On the left.
JUOLA: OK. Well, some people would say on the left. Some people will say to the left, and some people will say at the left. It's the same fork in the same spot, but you'll notice the preposition is a little different. And I don't think there's a wrong answer to that question, but there's a lot of different right answers, that people don't even notice they're making those choices.
LICHTMAN: Hmm. Mm-hmm.
JUOLA: So if you look at those kind of words, you can actually - by comparing the sort of words that people use in the different situations, you can actually get a pretty good profile of how any specific person uses language, and compare that profile against any other documents that you don't know the author of.
LICHTMAN: Well, what are some of the other tests that you did to the...
JUOLA: Well, we did the test of the most common words, which is the one that I just illustrated. The most common words tended to be, like, prepositions, pronouns, little words that don't mean much, but they come up all the time. Something else that we did was we did word pairs. So we looked at all of the words in context with the word that followed them immediately to see what kind of concepts got linked in the writing of the various authors.
We also looked at what are called character four-grams. That's basically groups of four adjacent characters - either as part of a word or possibly crossing two adjacent words - which lets us look at things like word stems. So if you're the sort of person who uses the word jump a lot, you might also be the sort of person who uses jumps, jumping, jumper.
And finally, we looked at word lengths, because if you're the sort of author who uses a large vocabulary consisting of huge words, that might be the sort of thing that carries across from different writings.
LICHTMAN: And then...
JUOLA: None of the...
LICHTMAN: Oh, go ahead.
JUOLA: I was going to say, none of these are conclusive individually, but you look at - but if you look at a big enough pattern you can get a pretty reliable estimate.
LICHTMAN: It's amazing to me that, you know, the most often used words, like the, how often I use the is actually that different from somebody else.
JUOLA: Well, it's not that different as a single identifier, but if we look at the hundred most frequent words or the 50 most frequent words, then you'll see there'll be a lot of little differences that add up. So you use the a little bit more commonly than average, and you use of a little bit less commonly than average. And you use less a little bit more, and you use beside a little bit less. And those kind of little differences will add up.
And that's why the computer is so good, because the computer can keep track of all these little differences.
LICHTMAN: Right. This would be almost - this would be impossible without that, almost.
JUOLA: Well, it wouldn't be impossible but it would be very, very difficult. There was actually a team back in the 1960s, Mosteller and Wallace at Harvard, that did this on the Federalist Papers by hand. The difference is it took them three years.
LICHTMAN: So how confident were you that this was J.K. Rowling?
JUOLA: Not actually that confident. I mean, I was very confident that it was somebody who writes an awful lot like J.K. Rowling. And J.K. Rowling has a fairly distinctive style. We all have fairly distinctive styles. But it might have been somebody who just writes a lot like J.K. Rowling. So all I could really tell the Sunday Times was that it was either Rowling, or somebody who writes surprisingly like Rowling.
But that was enough for them to ask Ms. Rowling, and she said, yeah, it was me. I've been rumbled.
LICHTMAN: Is there a way to do this kind of investigation when you don't have an author in mind?
JUOLA: Well, it's harder to do when you don't have an author in mind, because you don't have anyone to compare against. What you can do is you can say, well, tell me what you know about the author. So I can say, well, the author is probably a man, or the author is probably a woman. Or maybe I can say the author is probably Canadian, or the author is probably college-educated, or the author is probably not a native speaker of English. The author looks like their native language may be French, or something like this.
LICHTMAN: Well, let's do...
JUOLA: We're actually...
LICHTMAN: Let me just stop you for a second. Tell me a little - some specifics here. What's different about man-writing than woman-writing?
JUOLA: Well, that's actually something that's been studied extensively by linguists for 30 years. The vocabulary is different. Women use more hedges. I believe that women use more color adjectives. Women use more scent adjectives, if I remember right. But really, this isn't the sort of thing necessarily that I personally know, because it's the computer that does most of the analysis. And what the computer will do is it will put together these - in some cases -thousands of factors to make judgments.
LICHTMAN: Could you disguise your writing? Or would you need to write a program to do that?
JUOLA: It's very difficult to disguise your writing, because a lot of these things are unconscious. Like you're probably not aware of what pronoun - what preposition you're using from moment to moment. You know, you probably wouldn't have thought that there's anything significant in saying it's at the left, for example. But there is some work being done. There's some very good work being done, for example, out of Drexel University on trying to fool this kind of stylometric analysis.
LICHTMAN: Have you ever looked at - investigated famous authors, maybe, and tried to understand sort of what gives them their unique flavor, that maybe you couldn't intuit as a reader, that you would really need a computer to tease out?
JUOLA: I've tried. I'm involved in a project entitled "The Riddle of Literary Quality" with a team out of the Netherlands. But it's actually very difficult, because so much that we think of as good writing is something that's very difficult to capture by computer. Computers are very good at counting things, like they can count the number of times you use the word of. But they're not very good at understanding things like symbolism or exciting chase scenes or realistic dialogue.
So all of the things that we think of at a high level being humans, computers don't really understand.
JUOLA: Computers don't really understand the difference between a CD that pays interest and a CD that plays music.
LICHTMAN: Are you working on any other mysteries right now?
JUOLA: Actually, our big project right now is a project for the Department of Defense. I'm doing this through a startup company called Juola and Associates that's trying to help commercialize this technology. We're looking at developing a computer security system. The idea is if you have to leave your computer for a few minutes, say to go visit the water fountain, somebody can't come in and start typing email, because the computer will say, wait a minute. The person who's typing this email isn't you.
LICHTMAN: Hmm. What about Lincoln? I heard you were looking into some Lincoln papers, too.
JUOLA: Yeah. We've got a project going with the Papers of Abraham Lincoln project. Back in the 1830s there was a tradition of anonymous letters to the editor - things signed Concerned Citizen or American Patriot and all of that - that were huge discussions of the political issues of the day. And we believe - historians believe - that among these authors was a then-unknown author - then-unknown lawyer named Abraham Lincoln.
Of course, this was before he was president of the United States. This was when he was just no one. But how many of these letters to the editor - which we've still got copies of in the archives of the newspapers - how many of them were actually written by Abe Lincoln, as opposed to his contemporaries? And what would this say if we could figure out which ones were by him?
LICHTMAN: Yeah. That sounds like a pretty high profile case. But I imagine this one was, too. What was it like to be part of this media storm?
JUOLA: Oh, it was really exciting. I'm a big Rowling fan. I really liked her Harry Potter books. And so I was absolutely delighted to be able to work on this at all. And then when Rowling admitted that it was her who had written "The Cuckoo's Calling," I was flabbergasted. And, you know, the media storm has been really intense.
LICHTMAN: Well, thank you for participating in at least one more interview with us here today.
JUOLA: Well, I'm happy to talk about this. It's a great project, and it's a lot of fun.
LICHTMAN: It's fun to listen to. Patrick Juola is a professor of math and computer science at Duquesne University.
NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.