Monday, February 26, 2007

Word of the day: Statistic

There has been a lot of Sound and Fury recently over a Statistic that appeared on the front page of the New York Times. That more than half of women in America are without spouse.

One of the key parts of every economist's training is a very deep understanding of statistics. It was in one of those fundamental classes, pondering the definition of the term sufficient statistic, that I got a better appreciation of just what exactly a "Statistic" actually is. A statistic is a way to represent one set of numbers drawn from the real world (data) using another set of numbers drawn from a more manageable subset. If that other set is "sufficient" then it tells you everything you need to know about the data. Unfortunately that is typically impossible.

The problem is the data you care about is typically multi-dimensional. Very multi-dimensional. To understand whether the coupling habits of American women are changing with time requires a very complex rich data set, with at least the age distribution of women (+2 dimensions), over time (+1 dim). That's 3 dimensions at least. A more complete picture would include whether these women were widowed (+1 dimension), gay (+1 dimension), living longer (+2 dimension for the life expectancies of each age cohort). In fact, to really get a complete picture, some would say you need to understand the stories of each of the 150 million women in America (+150 million dimensions), each with their own set of characteristics and life histories (++++ dimensions).

The problem is that humans can barely picture 3 dimensions (the real world), can only readily print 2 dimensions (a graph), and really only a 1 dimensional statistic (a number) can fit into a headline. And if you want to make that headline snappy, you make it 0-dimensional (a binary yes-no factoid).

Which is where the new york times article caused so many problems. They tried to make a big deal out of this arbitrary 0 dimensional statistic (that > 50% of women are without spouse), which is what newspapers and the media often do and which I quickly dismissed. And so they were hit with people bringing up all the dimensions they miss.

If only we were built to picture n-dimensional hyperspaces and could fit them into a headline. We wouldn't have this problem. Edward Tufte (admittedly) has come closest, going so far as to try to fit highly multidimensional data in-line with text.

No comments: