Posts Tagged statistical stylometry
Yesterday I was talking to Robert Scoble about context-aware computing and we ended up on the topic of computer analysis of text. I’ve done some work in this area over the years for ancient text author attribution, cheating detection in college and professional essay exam scenarios, and for sentiment and mood analysis. A technique common to authorship studies is statistical stylometry, which aims to quantify linguistic style. Subtle but persistent differences between text written by different authors, even when writing about the same topic or in the same genre often emerge from statistical analysis of their writings.
Robert was surprised to hear that Ted Kaczynski, the Unabomber, was caught because of linguistic analysis, not done by computer, but by Kaczynski’s brother and sister-in-law. Contrary to stories circulating in the world of computational linguistics and semantics, computer analysis played no part in getting a search warrant or prosecuting Kaczynski. It could have, but Kaczynski plead guilty before his case went to trial. The FBI did hire James Fitzgerald, a forensic linguist, to compare Kaczynski’s writings to the Unabomber’s manifesto, and Fitzgerald’s testimony was used in the trial.
Analysis of text has uses beyond author attribution. Google’s indexing and search engine relies heavily on discovering the topic and contents of text. Sentiment analysis tries to guess how customers like a product based on their tweets and posts about it. But algorithmic sentiment analysis is horribly unreliable in its present state, failing to distinguish positive and negative sentiments the vast majority of the time. Social media monitoring tools have a long way to go.
The problem stems from the fact that human speech and writing are highly evolved and complex. Sarcasm is common, and relies on context to reveal that you’re saying the opposite of what you mean. Subcultures have wildly different usage for overloaded terms. Retirees rarely use “toxic” and “sick” as compliments like college students do. Even merely unwinding phrases to determine the referent of a negator is difficult for computers. Sentiment analysis and topic identification rely on nouns and verbs, which are only sometimes useful in authorship studies. Consider the following sentences:
1) The twentieth century has not been kind to the constant human striving for a sense of purpose in life.
2) The Industrial Revolution and its consequences have been a disaster for the human race.
The structure, topic, and sentiment of these sentences is similar. The first is a quote from Al Gore’s 2006 Earth in the Balance. The second is the opening statement of Kaczynski’s 1995 Unabomber manifesto, “Industrial Society and its Future.” Even using the total corpus of works by Gore and Kaczynski, it would be difficult to guess which author wrote each sentence. However, compare the following paragraphs, one from each of these authors:
1) Modern industrial civilization, as presently organized, is colliding violently with our planet’s ecological system. The ferocity of its assault on the earth is breathtaking, and the horrific consequences are occurring so quickly as to defy our capacity to recognize them, comprehend their global implications, and organize an appropriate and timely response. Isolated pockets of resistance fighters who have experienced this juggernaut at first hand have begun to fight back in inspiring but, in the final analysis, woefully inadequate ways.
2) It is not necessary for the sake of nature to set up some chimerical utopia or any new kind of social order. Nature takes care of itself: It was a spontaneous creation that existed long before any human society, and for countless centuries, many different kinds of human societies coexisted with nature without doing it an excessive amount of damage. Only with the Industrial Revolution did the effect of human society on nature become really devastating.
Again, the topic, mood, and structure are similar. Who wrote which? Lexical analysis immediately identifies paragraph 1 as Gore and paragraph 2 as Kaczynski. Gore uses the word “juggernaut” twice in Earth in the Balance and once in The Assault on Reason. Kaczynski never uses the word in any published writing. Fitzgerald (“Using a forensic linguistic approach to track the Unabomber”, 2004) identified “chimerical,” along with “cool-headed logician” to be Kaczynski signatures.
Don’t make too much – as some of Gore’s critics do – of the similarity between those two paragraphs. Both writers have advanced degrees from prestigious universities, they share an interest in technology and environment, and are roughly the same age. Reading further in the manifesto reveals a great difference in attitudes. Though algorithms would have a hard time with the following paragraph, few human readers would identify the following excerpt with Gore (this paragraph caught my I eye because of its apparent reference to Thomas Kuhn, discussed a lot here recently – both were professors at UC Berkeley):
Modern leftist philosophers tend to dismiss reason, science, objective reality and to insist that everything is culturally relative. It is true that one can ask serious questions about the foundations of scientific knowledge and about how, if at all, the concept of objective reality can be defined. But it is obvious that modern leftist philosophers are not simply cool-headed logicians systematically analyzing the foundations of knowledge.
David Kaczynski, Ted’s brother, describes his awful realization about similarity between his brother’s language and that used in the recently published manifesto:
“When Linda and I returned from our Paris vacation, the Washington Post published the Unabomber’s manifesto. After I read the first few pages, my jaw literally dropped. One particular phrase disturbed me. It said modern philosophers were not ‘cool-headed logicians.’ Ted had once said I was not a ‘cool-headed logician’, and I had never heard anyone else use that phrase.”
And on that basis, David went to the FBI, who arrested Ted in his cabin. It’s rare that you’re lucky enough to find such highly distinctive terms in authorship studies though. In my statistical stylometry work, I looked for unique or uncommon 2- to 8-word phrases (“rare pairs“, etc.) used only by two people in a larger population, and detected unwanted collaboration by that means. Most of my analysis, and that of experts far more skilled in this field than I, is not concerned with content. Much of it centers on function-word statistics – usage of pronouns, prepositions and conjunctions. Richness of vocabulary, rate of introduction of new words, and vocabulary frequency distribution also come into play. Some recent, sophisticated techniques look at characteristics of zipped text (which obviously does include content), and use markov chains, principal component analysis and support vector machines.
Statistical stylometry has been put to impressive use with startling and unpopular results. For over a century people have been attempting to determine whether Shakespeare wrote everything attributed to him, or whether Francis Bacon helped. More recently D. I. Homes showed rather conclusively using hierarchical cluster analysis that the Book of Mormon and Book of Abraham both arose from the prophetic voice of Joseph Smith himself. Mosteller and Wallace differentiated, using function word frequency distributions, the writing of Hamilton and Madison in the Federalist Papers. They have also shown clear literary fingerprints in the writings of Jane Austen, Arthur Conan Doyle, Charles Dickens, Rudyard Kipling and Jack London. And for real fun, look into New Testament authorship studies.
Computer analysis of text is still in its infancy. I look forward to new techniques and new applications for them. Despite false starts and some exaggerated claims, this is good stuff. Given the chance, it certainly would have nailed the Unabomber. Maybe it can even determine what viewers really think of a movie.