Quartz

How one linguist used Twitter to map the geography of "filler words."

Every language has filler words that speakers use in nervous moments or to buy time while thinking. Two of the most common of these in English are “uh” and “um.” They might seem interchangeable, but data show that their usage breaks down across surprising geographic lines. Hmm.

The map above shows a preliminary attempt to use the tremendous amount of linguistic data being produced on the web to understand how language works. Jack Grieve, a forensic linguist at Aston University in the U.K., has been looking through 6 billion words collected from Twitter. Following a discussion with fellow linguist Mark Lieberman—a prolific blogger who has long been interested in the “um”/”uh” divide—Grieve decided to look through his collection of tweets to see how the two words compared. They started their exploration with data from America.

If a county on the map is bright blue or bright pink, its tweets show a clear tendency toward “um” or “uh,” respectively. The purplish colors in between mean that a county’s results leaned one way, but weren’t clear representations of a regional trend.

To uncover the geography of filler words, Grieve ran through the Twitter corpus to find how often a given American county uses “um” over “uh” and vice versa. After that, he used an algorithm known as “hot-spot testing” to smooth out the results and make them more meaningful.

The smoothed-out version has a lot to say. The regional breakdown is clear, and it doesn’t look much like other maps that try to show where some phenomenon or another is happening in the United States. Grieve said the use of “um” looks to follow the elusive “Midland dialect,” which linguists have suspected follows the Ohio River southwest from central Pennsylvania. That accounts for most of the blue that sweeps from West Virginia all the way to Arizona. Grieve said the “uh” and “um” analysis is the first time his research has shown clear evidence of the Midland dialect.

The map also shows that usage on the west coast is harder to pin down. The purplish color leans toward “um” in most of California, aside from the Bay Area, but there is no clear winner west of Arizona.

Hot-spot testing explained

Hot-spot testing has a variety of applications in statistics, but the goal is to put individual data points in geographical context to uncover broader tendencies. A retailer might be interested, for example, in knowing whether a new product is selling better in certain parts of the country. Let’s say it’s selling reasonably well at a location surrounded by dozens of stores where almost nobody is buying it. By comparing this store to its neighbors, a hot-spot test can identify this broader region as one where the retailer’s strategy is not working.

The same technique is used to reveal the regional scope of spoken dialects: Grieve compared each county’s “um”/”uh” split to those of several nearby geographical areas. “We do this because dialect data is generally very messy, so this is a way of extracting the underlying regional signal,” Grieve said in an email. To test the algorithm’s validity, he tried hot-spot testing on sets of random data. These tests revealed no trends whatsoever, he said.

This is what the data look like before hot-spot testing, with a percentage of “um” versus “uh” for each county:

More, uh, possibilities

Geography is not the only possible answer to the “um” versus “uh” mystery. Earlier research by Lieberman suggests that women use “um” more often than men. Also, using these in writing is much different than using them in, uh, person. People on Twitter, for example, often use it to express awkwardness or condescension.

Here’s an “um” that pokes fun at Apple executives, for example:

And an “uh” that does the same for the Obama administration’s view on its authorization to use military force against terrorists:

Nevertheless, the tone of Twitter prose is informal, meaning colloquialism and linguistic quirks come through. And even if people use “um” and “uh” for snark, they still have to choose one over the other. Grieve is looking to mine the Twitterverse for ever more linguistic insights. We’ll, erm, keep you updated as new data come out.

This post originally appeared on Quartz, an Atlantic partner site.

More from Quartz:

When It Comes to a China Growth Slowdown, Bad News Is Actually Good News

Atheism Has Finally Found Its Spiritual Leader

Google’s New Phones Are Designed to Ensure It Doesn’t Lose India Like It Did China

About the Author

Most Popular

  1. James Mueller (left) talks to South Bend Mayor Pete Buttigieg (right)
    Equity

    South Bend's Mayoral Election Could Decide More than Pete Buttigieg's Replacement

    Pete Buttigieg's former chief of staff, James Mueller, is vying with a Republican challenger to be the next mayor of South Bend, Indiana.

  2. The Buoyant Ecologies Float Lab
    Design

    Designing the Floating Future

    A prototype in the San Francisco Bay is testing a vision for floating buildings built to withstand sea-level rise. And it’s distancing itself from some other utopian visions for floating cities.

  3. Uber Eats worker
    Life

    The Millennial Urban Lifestyle Is About to Get More Expensive

    As WeWork crashes and Uber bleeds cash, the consumer-tech gold rush may be coming to an end.

  4. a photo of a semi-autonomous dockless scooter
    Transportation

    One Way to Keep the Sidewalk Clear: Remote-Controlled Scooter-Bots

    A new mobility technology company called Tortoise promises to bring semi-autonomous scooters and e-bikes to market. Why?

  5. A man wearing a suit and tie holds an American flag at a naturalization ceremony.
    Life

    The New Geography of American Immigration

    The foreign-born population has declined in U.S. states that voted Democratic in 2016, and increased in states and metros that voted for Trump.

×