Data Analysis Alone is Not Enough

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

AS A STATISTICIAN, I respect science expertise because I need it to do my job right. In the early days of the pandemic, I saw debate develop between those advocating we listen to experts and those who felt we should listen to the data. This is a false dichotomy, let me say why...

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

This false dichotomy was concerning to me because analyzing data without scientific expertise is dangerous. Analysis needs to be grounded in real world experience (derived from experiments) of what is plausible. Without this, we can be worse off for having looked at the data.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

A lot of people think asking questions of data is a foolproof way of learning more about the world. That as long as your statistics methodology is good and you have high-quality data, you will be smarter after you try to answer a question with data than before. This is not true.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

You have to be careful about asking questions of data because badly framed questions can smuggle assumptions into your thought processes that actively make you more wrong than if you never tried to answer those questions at all.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

"What kind of cheese is the moon made of?" is a bad question that will actively make you stupid. No amount of investigating what cheese most closely matches the properties of the moon will help you here. Good cheese data won't help you. State of the art algorithms won't help you.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

The only thing that will help is to reject the question and go back to where you were before you asked it. You need to ask what material is the moon made of? Could it be cheese or something else? This seems obvious but people make this mistake in data analysis all the time.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

"Are covid-19 infection rates affected by racial genetics?" is potentially one of these kinds of bad questions. In the context of looking at broad patterns between racial populations, it assumes too much. The point here is subtle so let me illustrate.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

Is being likely to have spent time in my childhood home genetic? Presumably, my nuclear family has genes specific to us and we all lived there. My relatives have genes that are more like me and they are more likely to have spent time in the house than the general population.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

I can promise you that if you get a dataset of people and try to figure out if being closely related to me makes you more likely to have spent time in my family home, you will see a beautiful pattern like my 1st cousins are more likely to be there than 2nd cousins and so on.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

This is the genetics version of the cheese question. We assumed that living in the house was genetic and then we are narrowing down the genetic variants that cause it. This model predicts the past data, but it doesn't predict future data because it's not a causal pattern.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

If you put those genes in a new person, they won't instinctively wander off in search of my childhood home. This kind of thing is why we need to be careful about looking at patterns in human data and thinking that these patterns automatically mean something is genetic.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

We need more evidence outside of just patterns. We need experiments to help us figure out plausible types of causation and we need to be careful about assuming everything that differs between races is genetic. Not only does it needlessly promote racism, it's often bad science.

馃敟Kareem Carr馃敟 @kareem_carr 路 Jul 09

Thanks for coming to my TED talk.

Learn about how to write your own.