Initially, Matthew Gerber didn't believe Twitter could help predict where crimes might occur. For one thing, Twitter's 140-character limit leads to slang and abbreviations and neologisms that are hard to analyze from a linguistic perspective. Beyond that, while criminals occasionally taunt law enforcement via Twitter, few are dumb or bold enough to tweet their plans ahead of time. "My hypothesis was there was nothing there," says Gerber.
But then, that's why you run the data. Gerber, a systems engineer at the University of Virginia's Predictive Technology Lab, did indeed find something there. He reports in a new research paper that public Twitter data improved the predictions for 19 of 25 crimes that occurred early last year in metropolitan Chicago, compared with predictions based on historical crime patterns alone. Predictions for stalking, criminal damage, and gambling saw the biggest bump.
"I was surprised," says Gerber. "In the thousands of tweets that I've read, you don't see people saying things like, 'I'm going to rob somebody tonight.'"
The experiment began with Gerber collecting more than 1.5 million public tweets tagged with GPS coordinates within the city limits between January and March, 2013. (Important privacy side note: Twitter users must opt-in to GPS tagging.) Meanwhile, he gathered information on all the documented crimes that occurred over that same period.
Next Gerber created a computer algorithm that separated the tweets into 1 kilometer by 1 kilometer neighborhoods, then analyzed the content of the tweets in each neighborhood to find out what people were tweeting about. The content was then lumped into hundreds of "topics." For instance, the foremost topic in the neighborhood around Chicago O'Hare pertained to travel, with tweets including words like gate, plane, flight, and of course, delayed.
Things get a bit technical from here. In basic terms, Gerber's model compared topics in a neighborhood to the historical crime data from that same spot in the city for a given month. The model formed correlations between topics and crimes, then used those correlations to predict crime in the same neighborhood for a subsequent month. The method is similar to the way Google Flu Trends uses search terms to predict outbreaks.
"So in the past maybe there's a cluster of thefts that occurred in a particular neighborhood," says Gerber. "The model will take that and say that's a neighborhood that's really prone to theft, now let's look at the Twitter content that's been generated. It's looking at that cluster of theft, and that Twitter content, and it's saying ok these words are highly associated with theft. And it makes a prediction on that basis."
For 19 crimes that occurred during these months in Chicago, Gerber's model did a better job predicting them than did the historical crime data alone. Of course, the method says nothing about why Twitter data improved the predictions. Gerber speculates that people are tweeting about plans that correlate highly with illegal activity, as opposed to tweeting about crimes themselves.
Let's use criminal damage as an example. The algorithm identified 700 Twitter topics related to criminal damage; of these, one topic involved the words "united center blackhawks bulls" and so on. Gather enough sports fans with similar tweets and some are bound to get drunk enough to damage public property after the game. Again this scenario extrapolates far more than the data tells, but it offers a possible window into the algorithm's predictive power.
From a logistical standpoint, it wouldn't be too difficult for police departments to use this method in their own predictions; both the Twitter data and modeling software Gerber used are freely available. The big question, he says, is whether a department used the same historical crime "hot spot" data as a baseline for comparison. If not, a new round of tests would have to be done to show that the addition of Twitter data still offered a predictive upgrade.
There's also the matter of public acceptance. Data-driven crime prediction tends to raise any number of civil rights concerns. In 2012, privacy advocates criticized the FBI for a similar plan to use Twitter for crime predictions. In recent months the Chicago Police Department's own methods have been knocked as a high-tech means of racial profiling. Gerber says his algorithms don't target any individuals and only cull data posted voluntarily to a public account.
"We lump everybody together and look at the aggregate of what are people talking about in this neighborhood," he says. "In that sense it feels a little more innocent than what people might immediately imagine when they hear about this kind of work."