Welcome once again to the fetid den of iniquity that is my blog-post, dear readers. As I mentioned last week, I have been conducting analyses on my second experiment, and progress could be called 'rocky'.

Well, if last week was rocky, it is now firmly ensconced in the Himalayas. Initially, I had performed an analysis of my data while eliminating outliers above and below certain thresholds across all participants and conditions. Statistical significance was achieved after this cleaning was done. However, based on previous research we also deleted outliers based on standard deviations of each participant. Statistical significance vanished.

Now, understand that before all this, I had forgotten to perform some of the most basic and elementary forms of data cleaning known to psychology before doing an analysis. I was feeling like a bit of a fool.

It transpired that certain participants had data that did not make the slightest bit of sense; reaction times that were unfeasible, etc, and were placing rather large spanners in the works of the data. Being the overenthusiastic (and also perhaps cerebrally-challenged) lad that I was, I had not noticed this. My supervisor did, and drew it to my attention. Having received this information, I proceeded to pummel my own head into nearby solid objects located in my office. To forget to deal with outliers, AND to forget to observe individual means (a lesson I was taught in no uncertain terms by my Honours supervisor) is an impressive brain failure, I thought. As a quick aside, I am my own worst critic when it comes to my career. These days, I am unforgiving (perhaps to a fault) when it comes to mistakes such as the aforementioned.

Generally I avoid directly giving advice in this blog, as I prefer people to take whatever insight they wish from the posts (e.g. "why are they allowing this coffee-addicted nutcase into the sunlight?"). However, I will say this; it's not in your interest to be an unforgiving disciplinarian when you make mistakes. The PhD learning curve is a steep one, and if you continue in academia it apparently doesn't shallow out that often. Mentally injuring yourself simply hamstrings your ability to learn from your mistakes. It's also why your supervisor is there; they're good at this stuff. My supervisor has been in academia for almost two decades. I have been a PhD student for 8 months. When I took the data to him, he saw the problem after a few minutes of perusal and basic analysis. Go figure.

The outcome of all this is that I must return to testing, both replacing the faulty data (and finding out how it happened) and expanding the sample size. The rest of the week is filled with testing schedules and importing the new data. Lessons have been learned, shoulders squared, heads un-pummeled and egos deflated (at least a little . . . :D), and the science goes on! See you next week, when hopefully I have new results to speak of!

-Harrison

Nice post. Two thoughts. If after two decades your supervisor doesn't know about some basic issue of data "cleaning" (yes, we have to deal with completely unmotivated tools that only add noise), then you should get a new supervisor.

ReplyDeleteThat's what I figured after being told to go do so.

Second, and more generally, your mentor may no go back to basics, and always consider different, unpublished was of looking at data, and thinking about the patterns in it. This is why it's fun to have motivated students, because the job of new blood is to have new ideas, or at least stimulate them. Most people plot means or medians and throw out all of the information about the shape (the higher order moments). That's a lot of lost information,and matters because chances are you're never actually looking at normally distributed data (gasp!). What you can do is actually plot all of your data, and look at the patterns it (does or doesn't) form. Then you think about how to account for the structure in your data. And what different analyses are appropriate,which depends on which aspects of the data you're trying to explain.

Thinking about data can be a lot more interesting than what is taught in most of the psych stat courses.

Not sure what your first point means; I made all the errors I mentioned here, not him. He was the one who picked them up.

ReplyDeleteAnd I'm not sure what your second point is trying to say. Could you re-word it? I'll admit it's been a long day and my brain is halfway out my ears at present. . .

Good post. Don't worry about making mistakes - we all do it at some point. The main thing is to change your habits so that it can't happen again, e.g. statisticians encourage making a habit of ALWAYS eyeballing all of your data (descriptive stats, histograms, stem plots, scatter plots, co plots, etc) across all dimensions before running ANY statistical tests.

ReplyDeleteI think what Bart is saying is that basing exploratory data analyses on finding statistically significant differences betweens means or medians of samples relies upon making a LOT of assumptions about your data and the nature of the underlying process/relationship you're investigating (i.e. that it comes from a normal distribution).

Quite often, the shape of the probability distribution is actually more interesting than the actual value of the mean/median - i.e. is it a bell curve (normal distribution), exponential, uniform, or some other distribution? Is it a skewed variant of a typical distribution? Some distributions such as the Cauchy don't even have a concept of a mean - even though you can calculate a "mean" from data (it'd be wrong).

Here's a simple example - take the bimodal distribution. It's basically like a normal/bell curve, but with two peaks instead of one. You might get bimodal data if there are two ways some event/process can occur (e.g. a quick simple way versus long complex way). The interesting thing is that the definitions of mean, median, and standard deviation are all "wrong"! Tests that rely upon an assumption of normality (e.g. two sample t-test) are invalid for this type of data (even though you can easily throw it into a stats package without thinking and it'd happily calculate a p-value for you).

http://en.wikipedia.org/wiki/Bimodal_distribution

By the way, a quick nitpick... "eliminating outliers above and below certain thresholds" is probably not the best way of explaining what you're doing for data cleaning (I hope not, anyway). Outliers should never be summarily deleted since it changes the shape of the data (it's a bit like cherry picking). I'm sure what you're doing is fine - but it's worth being careful with semantics and terminology.

they are both intended at words of encouragement. The first point is just that your adviser has a lot more experience than you, and has learned from it. S/he probably also didn't initially know about not using crap data, or how to "clean" data, until someone pointed it out to him. So don't be hard on yourself.

ReplyDeleteThe second point is more about innovation. Young brains may think of some way of looking at data that hasn't been thought about before. Much of your post is about data analysis, so I was just pointing out that there's always room for innovation there.