Maps

Graphing New Yorkers' Lives Through the Open Data Portal

The I Quant NY blog mines NYC's massive data clearinghouse to visualize issues facing city dwellers, from education to eating.

Image
​Ben Wellington broke down Citi Bike users by gender using NYC's Open Data Portal. (I Quant NY)

Ben Wellington is the man behind I Quant NY, a blog dedicated to telling the stories hidden in New York City’s Open Data Portal, a clearinghouse of more than 1,300 data sets from city agencies. Started by the city government in 2011, the open data initiative’s goal is to facilitate government transparency and increase civic engagement.

The blog itself comes out of a stats course Wellington teaches at Pratt Institute’s graduate program for city and regional planning, where he uses these data sets in coursework. Covering everything from gender divides in Citi Bike usage to finding the farthest point away from a Starbucks in Manhattan, Wellington’s larger mission is to get people thinking critically about the numbers that, if analyzed right, can be the key to understanding New York City.

He spoke to CityLab about his blog, his hope for the open data movement, and some of his favorite data sets.

Your statistics course at Pratt uses material that’s not in any textbook. How did you end up teaching it?

Learning about statistics can be quite painful. If you're coming in from a non-quantitative area, it's not why you went to school. So, I pitched a course to Pratt based on New York City Open Data. So instead of saying, "Bill had this test, Jane took this test, what's the difference?" you're studying changes in homelessness rates and things which are actually much more relevant to planners.

And the blog came out of the course?

The impetus of the class was to make statistics more interesting and point out the ways that data can be part of policymaking. So the class is about showing that and the blog is about doing it.

Every post comes with a straightforward graph and an explanation of your methodology. Are you deliberately trying to keep the data easily understandable and teachable?

It's impossible to make sure that people don't misinterpret things. When you look at a data set, you have to be able to figure out what the data says, not what the person who made it thinks that it should say. Those are usually two different things. There are lots of tricks people use about the way they build graphs to tell different stories with the same data. What's most important to me is that people stop and critically think, as a consumer, about what it is that they are seeing and ask questions.

When it comes to visualizations and videos, people like to see beautiful data moving around everywhere. If you cram too much in, you leave saying, “Wow, that looked cool,” but don't necessarily have a story to tell.

NYC restaurant inspection results.  A ratings are in green, B ratings in blue, and C ratings in red.
(I Quant NY)

Your analysis of Department of Health restaurant grading data suggested there was evidence of grade inflation on the part of health inspectors—you even offered a policy recommendation on how to fix this problem. Data analyses often stop at just presenting the data. Why did you offer a policy recommendation here?

In a perfect world I would keep everything to myself. The health inspection scores was an example where you're really seeing a big spike of grades at the very lowest of the A compared to the highest B, and my best explanation for it was grade inflation. I could be wrong, there are possibly other reasons. But if that's the case, one way you can get around that is to have two inspectors come in and add their scores together.

Only the Department of Health would know what's actually going on. I don’t have the granularity to see and I don't know who the inspectors are, so I can't analyze this, but maybe I've got a conversation going at that agency. Hopefully I did.

Expectations of parents, teachers, and students in New York City public and charter schools.
(I Quant NY) 

One of your most compelling posts looked at the way people viewed charter schools and public schools. You noticed that the biggest difference might come down to the “parent factor”—that charter-school parents have higher expectations and ratings for the academic experience at their children’s charter schools compared to parents’ views of their children’s public schools. What's the story there?

It was a very small, simple look at data. It basically showed that, when you talk about the academic experience, there's no discernible difference between charter schools and public schools [in] the way students and teachers are interpreting how good of an education students are getting. But [the charter school] parents think that charter schools are better.

What does it mean? It's not clear, but there's a lot of talk about the parent involvement in charter schools and the '"cream-skimming" effect, which is the idea that charter schools will take away the best students from public schools because, inherently, charter schools are going to have parents that are more involved because of the fact that they went and found that charter school and entered the lottery.

You're hitting pretty much every touchstone issue in New York City. How do you decide what topics to explore?

I keep an eye on what the current policy issues are, and if I can add to that conversation by doing some quick data analysis then I'm excited to do that. And then generally, I just browse the New York City data portal. You can take a data set like permits given for vending—that doesn't sound very interesting, but you can do all sorts of things. Where do the permit holders live? Where are they operating? How long have they had it? Any data set might have interesting information in it. I'll download it and explore it and see if there’s a story in there. For every post I've probably looked at three or four data sets.

Is there any data set you wish you had better access to?

The list is endless. There are two directions you can go with that. For example, taxi-ride data. That's an example where you actually can get the data, but you have to do a Freedom of Information [Act] Request, and then you have to bring a hard drive to them. The fact that the TLC [Taxi and Limousine Commission] will put stuff on a hard drive for you, relatively speaking, is impressive. But in the spirit of easy access, it's a little disappointing.

The other side of the coin is that the data they do put out is in all different formats. It's not particularly standardized. I don't want to complain because it's great that it's available, but at the same time it would be nice if they had a format that they tried to share. I went to a talk and learned that one of the biggest users of New York City Open Data is New York City, because agencies, for the first time, can actually look at what other agencies are doing.

We think they'd learn from something like the MTA [Metro Transit Authority]. At first, the MTA would license all the subway data, and they would be like, “You can't use our maps, you can't do this, you can't do that.” Eventually, they changed their minds and let people build apps and all of a sudden you have this really cool ecosystem. I think that is something the city can learn from.

Do you hold your blog to this same standard of openness?

All science, especially computer science, should provide the raw data that they work with. This is not something that's picked up in academia. I always try to lay out my methods so that somebody could do the exact same thing and get the same data. Otherwise, people publish papers, and let's say they have a bug or a mistake—no one will ever know. I think it’s a hugely important culture change that people should be moving towards. In light of the city opening up data, we should all be equally open.

About the Author