Feeling unfulfilled pushing 98% accuracy on State Farm

So I’ve been trying to repurpose VGG for the State Farm competition. For many hours, I was stuck at 10% - like a lot of folks. Couldn’t even get close to Jeremy’s worst model from class. I finally figured out how to mess with the learning rate, but even that didn’t help.

After rereading forum posts yet again, I just decided to set all the dense layers as trainable, and re-ran things on my sample - got over 90% accuracy on the first shot. Then on the full data set - first epoch got to val_acc: 0.735, and #2 hit 98% accuracy (admittedly with plenty of overfitting).

Now that the initial shock has worn off, this isn’t as satisfying as I’d hoped. It’s not like I have a better understanding of the data, or a deep understanding of why retraining the dense layers is so effective (at least not yet). I just tried something, and it sort of worked.

I have an economics background. In that field, a model is typically informed by theory. Data mining is looked down upon as mere correlation, when what you need for policy making is a theory about (and statistics demonstrating) causality. At least correlation often has an intuitive interpretation even if interpretation can literally be dangerous. But this … it’s just a mashup of simple math and a ton of data.

Is anyone else struggling with this? I have trouble trusting what I’m seeing. Is there any external validity? I know there are a million blogposts about this issue, but I’d love to hear your thoughts, or be directed to some writings that you think grapple with this issue effectively.

I have a feeling that this is partly a generational thing - a generation growing up with self-driving cars and hacking neural nets from high school or college will trust them because they do work. Is my suspicion unreasonable? According to these papers, it’s not entirely unreasonable. But these may be outliers.


Still, as someone who will likely be depending on neural networks to solve difficult real-world problems for real customers spending real money, I am a bit troubled by their black-box nature. They must be terrible to debug!


You are for sure not alone in this struggle: http://meshwired.com/engineers-unable-understand-working-googles-search-ai/

But what a result!! Well done!

Yes the difference between a good result and a bad result is difficult to explain and usually its just a hyper parameter difference.

I’m fairly sure you don’t have 98% accuracy on state farm. If you do, you’ve smashed the competition winners - but more likely, I suspect, is that you haven’t created a validation set containing a different set of drivers. This is what makes the competition interesting, and will force you to think long and hard about how to handle this tricky problem… :slight_smile:

Try submitting your answer to the kaggle competition - if you’re not #1, you’ll know that your validation set isn’t created correctly.

Overall, I’d say that to really crack these more challenging deep learning problems, you’ll need to become adept at visualizing and interrogating deep models, to get a good understanding of what they’re doing. I’m hoping that working through the state farm dataset will help you learn these tools.


Oh yeah, I know I wasn’t actually getting 98% - but going from 10% for hours to 98% (even with overfitting) was pretty exhilarating.

And @jeremy I’m still curious about how you think about the external validity of these models.

@robin I agree that neural nets are currently very frustrating to debug, and this is an area that Jeremy and I are interested in researching improvements.

Although NNs are black-box-like in some regards, there are ways to get insight, such as looking at what examples our model is most/least confidently wrong about, using PCA to see the top/bottom 10 movies for each latent factor in an embedding, or using nearest-neighbors on a layer of your NN to see which training examples a test case is most similar to. In general, I think this has been an under-studied area (since there are so many more incentives around winning ImageNet than around better understanding the models we already have), and I’m hopeful that there are still a lot of useful techniques to be discovered.

1 Like

As with most things, Leo Breiman said it much better than I can, so I’ll start by referring you all to this paper, which I think is perhaps the most important paper in data science written in the last 20 years: http://projecteuclid.org/euclid.ss/1009213726 .

Basically, the contention is this: given the choice between a simple “interpretable” model, and a complex but extremely accurate model, I’d take the latter every time. Because if it’s really accurate, we can ask it questions, like:

  • What would you predict if I gave you this data (to which I already know the answer)?
  • Which groups would you give similar predictions to?
  • If I vary just one data item up and down, how do your predictions change?
  • If I randomly shuffle one data item, how much less accurate do you become?

These kinds of questions allow us to get comfortable that a model is accurate under a range of conditions, understand what clusters it sees in the data, see how each input variable impacts the outcome of the model (using a partial plot), and see which variables are most important (using a variable importance plot). These are just a few example of the kinds of ways we can interrogate an accurate model.

I agree with Rachel that we still have a lot to learn about how best to understand (and debug) these kinds of models. But I think that Breiman’s basic insights hold, and that gives me a lot of confidence that the answers exist - we just have to find them.


This goes some way in helping understand what models see: https://homes.cs.washington.edu/~marcotcr/blog/lime/

That’s an interesting paper! Never seen it before. A deep learning specific apprach to solving that problem is here: http://cnnlocalization.csail.mit.edu/ .

1 Like

Thanks all for the thoughtful responses. I particularly like Jeremy’s insight about the value of being able to interrogate a highly accurate model even if it’s a bit opaque. I started reading the Breiman paper last night and it’s excellent.

For the record, this tweaked-VGG model put me into the top 35-40% depending on the epoch (actually got worse on the second epoch). I’m going to try some of Jeremy’s techniques for handling overfitting and see if I can boost the score a bit!

1 Like