Checking in on the latest advancements, and the challenges that remain.
There’s been no shortage of hype about the relationship between cities and data, especially so-called big data. For large numbers of tech companies, cities, and even a growing number of urbanists, data promises to solve all manner of urban problems, from predictive policing to improving traffic flow to promoting energy efficiency.
An even bigger potential role for new kinds of data lies in helping researchers and policy-makers better understand how cities and neighborhoods grow and evolve—but only if done right.
The legitimately exciting use of new data
A growing number of researchers are using data from internet sources such as Google, Twitter, and Yelp to develop new insights into cities and urban change. The sociologists Robert Sampson and Jackelyn Hwang have used Street View images to examine the role of race in the process of gentrification and neighborhood transformation. Similarly, a study from the U.K. Spatial Economics Research Centre used geo-tagged photos on Flickr to determine levels of urbanity in London and Berlin. Mobility data from Uber and Lyft—and even taxicabs—has also been used in several recent studies, which my CityLab colleague Laura Bliss and former colleague Eric Jaffe have chronicled. Data from real estate sites such as Zillow and Trulia is also being used to analyze housing price trends across neighborhoods, cities, and metro areas.
Other research has used reviewer data from Yelp to study gentrification and unequal urban consumption patterns. One study used Yelp reviews to shed light on the connection between gentrification and race in Brooklyn. Another NBER study employed Yelp data to find out how ethnic and racial segregation affects consumption levels in New York City.
Twitter data has been used to chart regional preferences and patterns of behavior. A study from the Oxford Internet Institute mapped the flow of online content and ideas across cultures. The cartography blog Floating Sheep has used data from Twitter, Google, and Wikipedia to map everything from beer and pizza to weed, bowling, and strip clubs. And my own team has used data from MySpace to track the leading centers for popular music genres across the U.S. and the world.
More recently, a team of Italian researchers combined data from Foursquare and OpenStreetMap, among other sources, to test Jane Jacobs’ theories of urban vitality and diversity in six Italian cities. Their study confirmed many of Jacobs’ key insights about the importance of short blocks, mixed land uses, walkability, dense concentrations of talented workers, and urban public spaces.
In addition to data from websites, satellite data offers the possibility of amassing systematic and comparable data across global cities (little, if any, has been previously available). Several studies (including my own) have used satellite data to get at the economic output of cities and metros around the world. And a 2012 study in the American Economic Review uses light emissions from satellites as a proxy for the spatial organization and economic size of global cities. While this data is subject to considerable limits, it provides at least rough estimates of the overall size and economic scale of cities across the world.
Accurately characterizing “big data”
Not all data from new sources qualifies as “big data,” which—as its name implies—refers to truly massive amounts of information. Max Nathan of the London School of Economics breaks down actual big data into three key categories: internet data from sites like Yelp, Twitter, or Google and other commercial data, government-sponsored data collected by cities or towns, and Census and related data. One example is a 2014 NESTA study, which used big data from the London-based firm Growth Intelligence to map patterns of information and technology businesses in the U.K. Another comes from a forthcoming study in the American Journal of Sociology, which uses data from millions of 3-1-1 service requests to examine neighborhood conflict among residents of different ethnicities.
According to Nathan, big data can be thought of in terms of “four Vs”: variety, volume (millions or billions of observations), velocity (real-time data), and veracity (raw data). Actual big data often requires data analytics methods like machine learning to process and derive meaning from such large troves of information. The ongoing Livehoods Project from the School of Computer Science at Carnegie Mellon University, for instance, uses machine learning to analyze 18 million check-ins on Foursquare to determine the structure and characteristics of eight different cities. When used appropriately, big data and new data analytics can help researchers discern urban structures and patterns that traditional data and methods might not uncover on their own.
A particularly good example of the use of big data is a recent NBER study by Harvard and MIT researchers, which uses computer visioning to better understand geographic differences in income and housing prices. Although the paper covers plenty of ground, perhaps the most interesting section involves the use of Google Street View to predict income levels and housing prices in Boston and New York between 2007 and 2014. The study links 12,200 images of New York City and over 3,600 images of Boston to data on median family income and home values from the 2006-2011 from the American Community Survey. It then examines the extent to which the positive physical attributes shown in these images (i.e. things like size and green space) attract more affluent residents and predict incomes and housing prices.
Ultimately, the study finds that “images can predict income at the block group level far better than race or education does.” The study notes that a key purpose of big data is to help illuminate the role of smaller geographic areas in our urban economies, which are harder to get at with traditional Census data. The authors conclude that big data offers “some hope that Google Street View and similar predicts will enable us to better understand patterns of wealth and poverty worldwide.”
Problems and limitations
While big data may ultimately be able to advance our observation of and theories about cities, a growing number of scholars urge caution in using it. A 2014 workshop, which brought together 40 or so leading urban social scientists and data users, identified six key issues surrounding big data, spanning data quality and compatibility, the use of new analytical techniques, and questions of privacy and security. As the workshop summary notes:
Developing theory to go with the new methods and data is critical, and is often sidelined. Engineering and control theory (or big data “without theory”) work well when there is a measurable outcome, a simple policy to correct for it, and fast enough reaction time that the correction can be implemented while it is still appropriate. In cities, this is the process used to optimize service delivery. But this theory does not work well for complex systems with long time horizons, like most social systems.
In other words, big data and new data analytics are only as good as the questions we pose and theories we generate to better understand them. No matter how powerful they may be, new data sources and analytic techniques are no real substitute for nuanced human reasoning about cities. The real power of course lies in using these new tools to test and deepen the insights of cutting-edge urban theory. My own hope is that we can eventually combine them in ways that deepen our understanding of the underlying “urban genomics” of neighborhoods, cities, and urban areas.