16 December 2014

Simulated or Real: What type of data should we use when teaching social science statistics?

I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested.

What did we do and why?

Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions.

Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts.

This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end.

This leaves students underprepared for actually using statistics in an undirected project (their thesis, in a job). Typically when confronted with data gathering and transformation issues in the real world most muddle through, piecing together ad hoc techniques as they go along in an decidedly non-efficient manner and often with poor results. A fair number of students will become frustrated and may never actually succeed in using any of the statistical tools they did learn.

What kind of data?

How does this course fit into a broader social science statistical education?

Zachary Jones had a really nice post the other day advocating that statistics courses use Monte Carlo simulation rather than real world data. The broad argument being that the messiness of real world data distracts students from carefully learning the statistical properties that instructors intend them to learn.

Superficially, it would seem that the course I just finished and Zachary's prescription are opposed. We could think of stats courses as using one of two different types of data:

simulated --- real world

Simulated vs. Real?

As you'll see I almost entirely agree with Zachary's post, but I think there is a more important difference between the social science statistic course status quo and less commonly taught courses such as mine and (what I think) Zachary is proposing. The difference is where the data comes from: is it gathered/generated by students or is it prepackaged by an instructor?

Many status quo courses use data that is prepackaged by instructors. Both simulated and real world data can be prepackaged. I suppose there are many motivations for this, but an important one surely is that it is easier to teach. As an instructor, you know what the results will be and you know the series of clicks or code that will generate this answer. There are no surprises. Students may also find prepackaged data comforting as they know that there is a correct answer out there. They just need to decode the series of clicks to get it.

Though prepackaged data is easier for instructors and students, it surely is counterproductive in terms of learning how to actually answer research questions with data analysis.

Students will not learn necessary skills needed to gather and transform real world data so that it can be analysed. Students who simply load a prepackaged data set of simulated values will often not understand where it came. They can succumb to the temptation to just click through until they get the right answer.

On the other hand I've definitely had the experience teaching with student simulated data that Zachary describes:

I think many students find [hypothesis testing] unintuitive and end up leaving with a foggy understanding of what tests do. With simulation I don't think it is so hard to explain since you can easily show confidence interval coverage, error rates, power, etc.

The actually important distinction in social science statistics education for thinking about what is more or less effective is:

student gathered/generated --- instructor gathered/generated

Prepackaged vs. student generated data

There is of course a pedagogical difference between data that students gathered from the real world and data they simulated with a computer. Simulated data is useful for teaching the behaviour of statistical methods. Real world data is useful for teaching students how to plan and execute a project using these methods to answer research questions in a way that is reproducible and introduces fewer data munging biases into estimates. Though almost certainly too much to take on together in one course, both should be central to a well-rounded social science statistics education.

10 December 2014

Set up R/Stan on Amazon EC2

A few months ago I posted the script that I use to set up my R/JAGS working environment on an Amazon EC2 instance.

Since then I've largely transitioned to using R/Stan to estimate my models. So, I've updated my setup script (see below).

There are a few other changes:

  • I don't install/use RStudio on Amazon EC2. Instead, I just use R from the terminal. Don't get me wrong, I love RStudio. But since what I'm doing on EC2 is just running simulations (I handle the results on my local machine), RStudio is overkill.

  • I don't install git anymore. Instead I use source_url (from devtools) and source_data (from repmis) to source scripts from GitHub. Again all of the manipulation I'm doing to these scripts is on my local machine.

4 December 2014

Our developing Financial Regulatory Transparency Index

Here is a presentation I just gave on a work in-progress. We are developing a new Financial Regulatory Transparency Index using a Bayesian IRT approach.

13 October 2014

Do Political Scientists Care About Effect Sizes: Replication and Type M Errors

Reproducibility has come a long way in political science since I began my PhD all the way back in 2008. Many major journals now require replication materials be made available either on their websites or some service such as the Dataverse Network.

This is certainly progress. But what are political scientists actually supposed to do with this new information? It does help avoid effort duplication--researchers don't need to gather data or program statistical techniques that have already been gathered or programmed. It promotes better research habits. It definitely provides ''procedural oversight''. We would be highly suspect of results from authors that were unable or unwilling to produce their code/data.

However, there are lots of problems that data/code availability requirements do not address. Apart from a few journals like Political Science Research and Methods, most journals have no standing policy to check the replication materials' veracity. Reviewers rarely have access to manuscripts' code/data. Even if they did have access to it, few reviewers would be willing or able to undertake the time consuming task of reviewing this material.

Do political science journals care about coding and data errors?

What do we do if someone replicating published research finds clear data or coding errors that have biased the published estimates?

Note that I'm limiting the discussion here to honest mistakes, not active attempts to deceive. We all make these mistakes. To keep it simple, I'm also only talking about clear, knowable, and non-causal coding and data errors.

Probably the most responsible action a journal could take to finding clear cut coding/data biased results would be to directly adjoin to the original article a note detailing the bias. This way readers will always be aware of the correction and will have the best information possible. This is a more efficient way of getting out corrected information than relying on some probabilistic process where readers may or may not stumble upon the information posted elsewhere.

As far as I know, however, no political science journal has a written procedure (please correct me if I'm wrong) for dealing with this new information. My sense is that there are a series of ad hoc responses that closely correspond to how the bias affects the results:

Statistical significance

The situation where a journal is most likely to do anything is when correcting the bias makes the results no longer statistically significant. This might get a journal to append a note to the original article. But maybe not, they could just ignore it.

Sign

It might be that once the coding/data bias is corrected, the sign of an estimated effect flips--the result of what Andrew Gelman calls Type S errors. I really have no idea what a journal would do in this situation. They might append a note or maybe not.

Magnitude

Perhaps the most likely outcome of correcting honest coding/data bias is that the effect size changes. These errors would be the result of Gelman's Type M errors. My sense (and experience) is that in a context where novelty is greatly privileged over facts journal editors will almost certainly ignore this new information. It will be buried.

Do political scientists care about effect size?

Due to the complexity of what political scientists study, we rarely (perhaps with the exception of election forecasting) think that we are very close to estimating a given effect's real magnitude. Most researchers are aiming for statistical significance and a sign that matches their theory.

Does this mean that we don't care about trying to estimate magnitudes as closely as possible?

Looking at political science practice pre-publication, there is a lot of evidence that we do care about Type M errors. Considerable effort is given to finding new estimation methods that produce less biased results. Questions of omitted variable bias are very common at research seminars and in journal reviews. Most researchers do carefully build their data sets and code to minimise coding/data bias. Sure many of these efforts are focused on the headline stuff--whether or not a given effect is significant and what the direction of the effect is. But, and perhaps I'm being naive here, these efforts are also part of a desire to make the most accurate estimate of an effect as possible.

However, the review process and journals' responses to finding Type M errors caused by honest coding/data errors in published findings suggest that perhaps we don't care about effect size. Reviewers almost never look at code and data. Journals (as far as I know, please correct me if I'm wrong) never append information on replications that find Type M errors to original papers.

Prescription

I have a simple prescription for demonstrating that we actually care about estimating accurate effect sizes:

Develop a standard practice of including a short authored write up of the data/code bias with corrected results in the original article's supplementary materials. Append a notice to the article pointing to this.

Doing this would not only give readers more accurate effect size estimates, but also make replication materials more useful.

Standardising the practice of publishing authored notes will incentivise people to use replication materials, find errors, and publicly correct them. Otherwise, researchers who use replication data and code will be replicating easy to correct errors.

22 July 2014

Note to self: brew cleanup r

Note to self: after updating R with Homebrew remember to cleanup old versions:

brew cleanup r
Otherwise I'm liable to get a segfault. (see also)

28 June 2014

Simple script from setting up R, Git, and Jags on Amazon EC2 Ubuntu Instance

Just wanted to put up the script I've been using to create an Amazon EC2 Ubuntu instance for running RStudio, Git, and Jags. There isn't anything really new in here, but it it has been serving me well.

The script begins after the basic instance has been set up in the Amazon EC2 console (yhat has a nice post on how to do this, though some of their screenshots are a little old). Just SSH into the instance and get started.