Wednesday, May 28, 2014

Bad Data Science of Love

I was sorta shocked when I received an invitation to take a course on Data Science from the "data scientist" behind the OK Cupid blog. While the blog is whimsical and fun and was largely responsible it seems for Ok Cupid's popularity, the statistics were largely un-grounded and amateurish, which is fine for a blog, but now he's going to teach these methods.. as "science?!"

A good example of this junk data science was on this recent ted talk featured on npr. It starts with a common Fermi problem of estimating how many suitable partners there are in your city (so common, I stumbled into this conversation the one time I visited the Stanford physics lounge). Usually by adding more and more criteria (your age, right gender, right education, food tastes, physical attractiveness, height), you find there is only 1. But usually those estimates are exaggerations because they assume independence which is not true.

She then uses an arbitrary formula for optimizing her ideal mate.  She made uparbitrary weight to arbitrarily selected characteristics.  Much like the simple scoring systems my mom used to use to pick jobs or things growing up. A useful tip, but its not science.

She then collected data from OKCupid profiles to figure out which are most attractive, but again, did nothing to separate causality from correlation, and most of her conclusions were based on qualitative assessments. Again, interesting, but not science.

There was a recent Wired story of a recent math phd who at least used machine learning algorithms to construct the optimal profile. There was at least some science there. But still, mostly ad hoc assumptions not grounded by theory.

I guess all this reminds me of the difference between econometrics and statistics. I was recently interviewed for a consulting opportunity and was asked why should we hire you as opposed to a statistician.

I think the answer is that to really do "science" about behavior, you can't look at data alone, you need theory. You need theory to separate causation from correlation. And on questions of choice, economics have 100 years of thinking hard and formally on the axiomatic primitives behind how choices are made.

1 comment:

markson said...

Welcome him at your business spot and make him connected there. It imagines in his heart being so close and significant. machine learning course