Dear Kris: a graph of words

Nearly 20 years ago I was roommates with Open Set contributor Kris Cohen. He went on to be a professor of art history; I ended up in San Francisco as a programmer for startups. This is an age, though, where technologists freely borrow from anthropology, art, and design, and when art and culture are shaped in part by technology and technology’s economic effects. Whenever we see each other, which is too rarely, we have excellent conversations, ones where we examine topics of common interest from perspectives (and using vocabularies) that are often very different. This is the first of a series of letters that I hope will interest both him and Open Set readers.

Dear Kris:

The other day I got my bee in a bonnet about something that I thought would interest you: an easy way to try to look at the cultural balance between men’s and women’s voices. Perhaps you’ll find a way to apply the technique to some of your work. And if not, at least I’ll get to complain about how hard good graphs are to make.

It started with an article in The Guardian’s data journalism blog. They were looking at the frequency of various phrases over time in the New York Times. They had several graphs showing how often particular phrases were used in articles each year, like “Brooklyn” vs “Manhattan” or “Britain” vs “France”. This is the one that really struck me:

NY Times

There’s clearly something going on there. It fits with my hazy intuition that historically we have paid a lot more attention to what men say. In the right side of the graph it has clearly gotten better, but there’s still a dramatic difference. But it leaves me wondering how much this is just the New York Times.

Lately I’ve been reading Lundy Bancroft’s book “Why Does He Do That?: Inside the Minds of Angry And Controlling Men”. It is possibly the most astute thing I had ever read. It’s by a counselor in a program for male domestic abusers. After years of doing that work, he developed a keen nose for bullshit; the book exposes so much about how male abuse works not just as an individual behavior but as a cultural dynamic. At the same time I’ve been listening to an (excellent) audiobook of The Iliad. The two together made it very clear to me how much the things Bancroft abhors are rooted in what Homer praised.

Curious, I turned to Google’s excellent Ngram Viewer. As part of their (now less prominent) quest to “organize the world’s information”, they started scanning books through the controversial Google Books Library Program.  At this point, they have more than 30 million books digitized (out of an estimated 130 million that exist), and they say they’ll have them all by decade’s end. The Ngram Viewer lets you look for occurrences of particular phrases. (The name, by the way comes from computational linguistics. Single words are unigrams. Two-word phrases are bigrams. Three, trigrams. Our nerdy love of generalizing turns that into n-gram. And Google mangles it as Ngram. Remind me of that next I complain about jargon in your field.)

For example, here’s their graph for postmodern, modernism, and postmodernism:


You’ll note the percentages are minuscule; that’s because they’re comparing against all words used in books. You get larger numbers (and an interesting graph) if you feed it “a, an, the“.

So I fed in “he said” and “she said” for a longer time period and got this:


Which is interesting enough, but I didn’t like the graph much. Before 1650 is kind of a mess. (I figure the problem there is just a relatively small number of books; in the decade 1640-1649, a total of 31 of Google’s scanned books contained “she said”.) But more generally, a lot of the ups and downs of the two lines seem unrelated to what really interested me. I found myself trying to compare the lines.

Happily, Google makes the raw n-gram data available. I downloaded a couple of the bigram files (which they unpoetically list under “2-gram”) and started digging. The files, at hundreds of millions of lines each, are large, but since I only cared about a tiny slice of them, I extracted the bits for “he said” and “she said”, pulled them into a spreadsheet, and started playing around. Pretty quickly, I got to something I liked better:



Here, instead of seeing just comparisons with the total corpus, we can see the ratio. Now it’s starting to tell a story. I don’t know exactly what the story is, but it’s good enough that I want to start asking other people what they read in it. But as I’m preparing to ask, I realize I’ve done something dumb. I set the top end of the graph to be parity, where “he said” is as common as “she said”. But that’s no limit. As I’m attempting to make a point about sexism, I’m also sustaining it.

Once I’ve fixed that, I’m still faced with an ugly spreadsheet graph. So I pull the lines into Inkscape and start tinkering. After a while, I end up with the graph below, with which I am pleased as punch. Shades! Colors! Lines! Visual Interest!


I of course think it’s awesome. To double-check, I exercise my user-testing powers and show the graph to a number of people. Universally, the reaction is, “What are all those lines?” And watching their faces as they puzzle over it, I cannot deny: I may understand the graph, but there is no way this will make sense to anybody who hasn’t followed the process. So it’s back to the drawing board. Crying inside, I strip out the “he said” and “she said” lines. It is undeniably clearer, and I spend some time resenting that. But eventually I get something that I think makes the point in a way that people skimming Twitter can still get it:


It still has problems. In particular, most people don’t understand logarithmic scales, but a linear y axis compresses the last century so as to make the differences look irrelevant. I could do it as a series of graphs, but that ruins sharing on Twitter and Facebook.

Grumbles aside, I can look at this and try to fit it to what history I know. There’s the long, steady rise leading up to women’s suffrage. Then we see a substantial decline that doesn’t really reverse until the late ’60s. The rise slows starting circa Reagan’s election, plateaus during the Clinton years, and declines in the Bush era.

What I like about this graph is that it leaves me with more questions than I started with. Are the events that occur to me truly causes? Symptoms of some underlying cause? Or is it just coincidence? To what extent does the graph represent cultural factors vs economic ones? How much can be explained by shifts in the publishing industry, and how much do those reflect real shifts in society at large? How does this gap persist even though women buy 58% of books?

I have no idea, of course. I’m reminded again how much I and my fellow techies are prone to the streetlight effect, taken from the joke about the drunk who is looking for his car keys under the streetlight even though he lost them half a block down in the dark. So much more comforting to look under the streetlight of hard data than to have to grapple with the darkness of qualitative analysis.

But still, I like the concreteness of this. One of the most infuriating things for me in talking about sexism with people in my industry is the number of them who deny that it exists as a significant problem. Not that this will prevent that, but it at the very least I can make it a little harder. One more dumb denial eliminated on the road to them saying, “Well, maybe I need to rethink this a bit.”