STEVE INSKEEP, Host:
If you use email at all, you've probably received unsolicited ads for hot stocks or schemes to shrink your waistline. And you may have received spam of a newer sort. They include ads, but also nonsense phrases strung together, and occasionally entire passages from literature like Charles Dickens or George Bernard Shaw.
NPR's David Kestenbaum has this report on how the great writers got mixed up with spam.
DAVID KESTENBAUM: One email begins like this:
Unidentified Man: How could you tell that they would make their attempt tonight? If they fire, Watson, have no compunction about shooting them down.
KESTENBAUM: Of course, it also contains this:
Man: Now listen, this stock could help you make huge amounts of money in weeks.
KESTENBAUM: Recognize that first passage. Greg Newby did. He's the director of Project Gutenberg, a nonprofit organization that has been putting the full text of books online since the very early days of the Internet.
GREG NEWBY: If they fire, Watson - sounds like it must be from a Sherlock Holmes.
KESTENBAUM: Yes, The Red-Headed League.
NEWBY: There you go.
KESTENBAUM: Newby says people sometimes contact him to complain about these kinds of emails, but he says they're not his fault.
NEWBY: No, we don't send spam. We're not doing anything other than trying to give away good literature.
KESTENBAUM: So who to blame for literary spam? Try Paul Graham. He's not a spammer, he's a programmer famous for creating one of the first really good spam filters. This is back in 2002, and he began by writing a little program to separate spam from ordinary email.
It did what you would expect - it looked for keywords like click, as in, click here to buy our product. But he says it didn't work very well.
PAUL GRAHAM: Well, for one thing, spammers could just replace the I in click, with a 1, and you'd be out of luck. And they did, in fact, start doing that.
KESTENBAUM: So Graham tried something different. He wrote a program to find out the best way to separate spam from real email. To train it, he would feed it lots of spam and lots of real email. The program learned that words like lunch tend to be a legitimate email, and words like Viagra or click spelled with a 1 are more likely to be in spam.
GRAHAM: You know, it was like 50 lines of code. It took me like a day to write.
KESTENBAUM: He ran the filter on his incoming email, and based on all the words in the email, it decided whether it was spam or legitimate. It caught over 99 percent of new spam, and it let all his real email through.
GRAHAM: Oh my God, I was so delighted. It got practically all my spam the first time through.
KESTENBAUM: And this is why the spammers have had to resort to literature. Filters like the one Graham wrote are everywhere now. In order to get past them, spammers try to make the text of emails look more like something you'd actually write, and to do that they've turned to literature.
There are thousands of books online. Spammers often need each email to look different, so sometimes the spam-making programs rearrange sentences. Other times they make up fake sentences out of pairs of words that tend to occur together. Graham says this is called Markov chaining, after the Russian mathematician, Andrei Markov. This explains the word salads you may see in spam.
GRAHAM: Hath last done my firmness gains to more glad heart where violent and from forage drives that glimmering of all sun new begun. Every pair of words in there, every two-word window in that text, actually occurs in Paradise Lost.
KESTENBAUM: Graham says the filtering technique still works pretty well, because the great authors of old use different words than usually appear in modern email. The word Bolshevism, he says, turns out to be a very good indicator of spam.
Still, some spam squeeze through and the results can be pretty entertaining. Read a line like, at this moment, a door in the tapestry opened, and you want to find the original book. What door? What tapestry?
Greg Newby at the Gutenberg Project says one spam caught his interest. It contained these mysterious lines:
NEWBY: No civilians, no outsiders. This is a secret operation all the way. It is now, but it goes public in seven days. All we do is invent a cover story.
KESTENBAUM: He'd love to read the book, but he says he can't figure where the text is from.
David Kestenbaum, NPR News.
INSKEEP: This is MORNING EDITION from NPR News. I'm Steve Inskeep.
RENEE MONTAGNE, Host:
And I'm Renee Montagne.