Statefarm kaggle comp

Did anyone try retraining several layers of VGG and use it for the StateFarm sample?
I’m getting very low accuracy: loss: 14.5922 - acc: 0.0947 - val_loss: 14.5546 - val_acc: 0.0970
I was wondering if this result was also expected even when retraining several layers.

This is what I did:
Using learning rate 0.01, and working on the sample

  1. Correctly split the dataset (I removed 3 drivers and use them for validation)
  2. Created vgg model, fine-tune last layer, fit model
  3. Retrain Dense layers as the example in lesson2.ipynb, fit model

i was having a similar issues with low accuracy, it may be that your training examples are shuffled a different way form your labels. you have to set shuffle=False when precomputing layers. otherwise each label will not match each training example.

1 Like

This was exactly the same problem I was having. In my case, reducing the learning rate made the model train. See the other thread where I mention this here:

I’ve been having a strange issue trying to reproduce the statefarm.ipynb. I’m getting a strange “MemoryError:” when trying to create convolutional test features. Everything’s working smoothly with training & validation features, but for some reason predict_generator(..) just won’t work on the test_batches.

I’ll put a link to my notebook below, & attach relevant screenshots of it. If I figure it out I’ll edit this post. This is running Keras 1.2.2, Theano 0.9.0, Python2.7

Showing how the notebook was setup at start:

Showing how the convolutional model was instantiated & the train/valid features working properly:

Finally the problem portion: the cryptic MemoryError: when running predict_generator(..) on the test set:

Out of Memory? Doubtful: I get the error even with a batch size of 1. There’s no change in system RAM, but there is a small (< 40%), ~1 second rise in CPU activity. So the computer is trying to set up something but it’s falling apart.

The GPU doesn’t seem to play a part, via watch -n 0.5 nvidia-smi, its RAM isn’t changing, and it’s easy enough to force a GPU out-of-memory error by bumping the batch_size up to 64 or so.

I tried setting class_mode='categorical' just to see, & no change. It looks like @Idano’s github gist was using class_mode='binary'. I tried that too with no change.

So the question then is what’s so special about the training set that it’s causing problems? It’s path is (path)/test/unknown/<~79k imgs>

I’m going to go over a few successful classmates’ notebooks & see if things work. I’m hoping the ghost of Bill Gates isn’t haunting me on Linux, but we’ll see.

Link to my (current) notebook on github:

Update: ran @idano’s code on getting convolutional test features from predict_generator. Same error. I’m wondering if there’s an issue with directory structure… but previous class examples had the test data in ../test/unknown/<imgs> as well…

Okay, the traceback’s pointing to .pyc files… these are compiled python files yeah? Maybe something got put together wrong. That’s my Linux machine, I’ll see if predict_generator will work on test_batches on my Mac.

Update: Well here’s something interesting. Running only the conv-layer test-features part from the statefarm notebook on my Mac… so far so good… Not sure if I’ll run it to completion since this is only via CPU… but it hasn’t crashed or shown any signs of anything wrong. It’s just working. I could try switching to CPU on my Linux machine to build test-features… seems clunky… but this is pretty far into the weeds already. I could check the versions of all packages in my fastai conda environments on the Mac vs Linux… but besides that there seems to be 2 viable workarounds:

  1. Build test predictions through a completed (Conv layers connected to FC blocks) model only
  2. Switch to CPU-backend, build conv_test_features, switch back to GPU to continue. (not tested yet)

Mac’s CPU’s chuggin’ along:

If anyone has any insight, much appreciated. Otherwise, I’m going to leave this as is w/ the workarounds and continue on.

Update June 6: This is very strange. Same memory error on the Linux machine even when using CPU. If having both conv_feat & conv_val_feat in memory uses over 10.5/16 GB at ~20k imgs, then the 80k test set is definitely going over… but that doesn’t explain why it would work on my mac which has half the RAM.

Yeah, even trying to save (bcolz) straight to disk doesn’t work. Figure this way it’d still need to hold everything in memory at once, so that may be why:

save_array(path + '/results/conv_test_feat.dat', conv_model.predict_generator(test_batches, test_batches.nb_sample))

Update June 7: Neverending story, eh? I keep coming back to this because it feels very important… generating conv-test-features … and no good reason for failure.

I ran essentially the same code in the above example, on the cats-dogs-redux test data. It works just fine. 1 note of interest: total RAM usage at 7.5/16GB… For 12,500 images? What about for StateFarm’s 80,000? Yeah. Doesn’t fit in memory.

Well after that journey I know what question to ask.

How do I save convolutional test features directly to disk, in batches, as they are created, so that I don’t run into memory limits?

I didn’t get this error on my Mac, which only has 8GB of RAM, but maybe there’s a difference in how OSX vs Ubuntu handle memory in this case? I’m thinking I’d cause a MemoryError Mac-side too, if I let the predict_generator run long enough to make enough test-features. I think I’ll test that when I’m not working on this machine.

maybe State Farm Full: how not to run Out of Memory with VGG + (da_batches.samples*5)?

1 Like

This is great, thanks Bahram. Thing is, this is training the model - I just want to create intermediate features without blowing up the RAM. Looking through documentation & forums now, but haven’t quite found the way how.

So now I know how to create features/predictions on individual or on a specific number of batches:

for batch in test_batches: ...
for ...:
xyz = model.predict_generator(test_batches, step)
where step is (I think) a multiple of batch_size, and/or ≥ batch_size and ≤ dataset size.

Now how are these saved to be used for the next step?

I just found out that bcolz saves things in a directory structure, not a single file - so it has a way of keeping track of what’s where. Again, feels like getting too far into the weeds.

There has to be a way to take what @bahram1 said and save features to disk as they’re created.


Oh hello…

append(self, array) Append a numpy array to this instance.

The carray class
class bcolz.carray
A compressed and enlargeable data container either in-memory or on-disk.

carray exposes a series of methods for dealing with the compressed container in a NumPy-like way.

That sounds a lot like what I was asking for. Will update based on what I find. If anyone has wisdom, please feel free to share.

Update: feels like I’m getting closer.

Same link as above, bcolz.carray class:

rootdir : str, optional

The directory where all the data and metadata will be stored. If specified, then the carray object will be disk-based (i.e. all chunks will live on-disk, not in memory) and persistent (i.e. it can be restored in other session, e.g. via the open() top-level function).

That looks promising. Taking a look at load/save_array in & the documentation at, bcolz.carray.flush() is how bcolz actually saves data to disk. I at first thought it was cleaning a buffer/stream like in C++. Nope.

Furthermore, in the first line of utils.save_array:

bcolz.carray(arr, rootdir=fname, mode='w')

‘w’ erases & overwrites whatever was at fname, but ‘a’ just appends. However that’s for a ‘persistent carray’. Specifying rootdir makes the carray disk-based…

Both are specified in the implement., so I’m going to guess and say ‘persistent’ isn’t necessarily limited to memory: it just exists… which makes me think… if Howard is using 'w', bcolz isn’t keeping the carray in memory, that’s just convolutional-features variable living in memory. Hopefully carray.flush() doesn’t torpedo this line of thinking, and bcolz is doing some other buffer-witchcraft that doesn’t require keeping everything in memory at once. Fingers crossed.

So, the point?:

Maybe I can use ‘rootdir=..’ and ‘a’ to write my test-convolutional-features to disk using the bcolz carray, as they are created batch by batch. We’ll see.

Update June 9: Done.

Finally got it working; submitted predictions - which also blew my previous best out of the water.

The code to save convolutional features to disk as they are created in batches:

fname = path + 'results/conv_test_feat.dat'
# %rm -r $fname     # if you had a previous file there you want to get rid of. (mode='w' would handle that maybe?)
for i in xrange(test_batches.n // beatch_size + 1):
    conv_test_feat = conv_model.predict_on_batch([0])
    if not i:
        c = bcolz.carray(conv_feat, rootdir=fname, mode='a')

The code for generating predictions on the saved convolutions:

idx, inc = 4096, 4096
conv_test_feat =[:idx]
preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
while idx < test_batches.n - inc:
    conv_test_feat =[idx:idx+inc]
    idx += inc
    next_preds = bn_model.predict(conv_test-Feat, batch_size=batch_size, verbose=0)
    preds = np.concatenate([preds, next_preds])
conv_test_feat =[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])

And that’ll do it. Few notes: of course, this assumes the usual imports, that stuff like test_batches are already defined, and etc. My inability to open an already existing bcolz carray by defining c = bcolz.carray(..), regardless of mode, & success in using c.append(..) after the carray is opened, makes it clear that the first code block can be cleaned up, especially to remove the if-else block. Also, idx and inc, index & increment, are user-defined; I picked them because they seemed big enough to not take too many disk-accesses, but small enough to not put too much into memory at once. Lastly, I take the zeroth index of because the generator returns as a tuple

Perhaps a few other notes, but that’s all off the top of my head at the moment. I’d love to see a ‘proper/pro’ way to do this (something straight out of keras would be nice!) from J.Howard or someone, but: it works. Looks like it’ll work for big stuff. It’s unconstrained by memory limits (video-mem not-included), so I’m happy with it.

Ah, another note: Doing the above, and running it through the bn_model after training that for I think only 5 epochs (1x 1e-3, 4x 1e-2), got a Kaggle score of 0.70947 at 415/1440 ranking. That’s top 28.9%.

Another thing I haven’t tested is using .fit_generator(..) on conv train/valid features pulled from disk, but that shouldn’t be a huge hassle compared to the above. May update this post down here to include jupyter notebooks for a full implementation, later.

Alright, that’s about it for this one!


So I’ve got two questions regarding the statefarm notebook. I’ve tried searching whether this had already been answered, but the Discourse search function isn’t wildly powerful, so there’s a chance this was a duplicate.

My question pertains to the model setup, I’ve included a copy below:

def get_bn_layers(p):
return [
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')

Here’s my two questions:

  1. Why do we split the convolutional part of the model from the fully-connected part, right at a MaxPooling2D layer? I could understand it if it were split right after the Flatten() layer, but I’m failing to understand this particular choice. My argument for splitting after the Flatten layer is that the only layers that need to be re-trained are the fully connected ones. There’s no real reason to ‘retrain’/include the MaxPooling2D and Flatten(). Or is there?
  2. Why do the BatchNormalisation layers come after the Dense layers? From what I gather of what BatchNormalisation does, is that it re-centers/regularises inputs. Wouldn’t we want to regularise the inputs before putting them into an activation layer? To quote the paper (emphasis mine):

We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs

Why does the notebook decide to do this after the Dense layers?

Well, I’ve figured out the answer to my 2nd question. Though it’s up for debate, it’s mostly preferable to apply a BN layer after the activation. Here’s the Keras author discussing it, there’s more links in there for those who are curious:

1 Like

I’m having trouble downloading the statefarm dataset. I have accepted the terms, but the trying to unzip the downloaded zip files give the following error:

nbuser@jupyter:~/nbs/data$ unzip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.

What’s going on?

1 Like

I am having the same issue. I downloaded the files directly from the website, and looked at them in a hex viewer since I couldn’t unzip them.
All that the files contain are zeros.

Any luck so far? I am getting the same error. I saw in the kaggle forums couple of people have posted with the same error.

I’m afraid not, I contacted kaggle support, but I don’t expect to hear anything from them during the weekend.

Ahh. I tried downloading today also. No Luck… Hopefully it will get resolved tomorrow.

I was able to download it today

Me too. Guess it got fixed. :slight_smile:

Has anyone encountered memory issue for Statefarm in aws?

I store the predictions of convolution features in array(It will be 4 times the actual size of training data because of data augmentation). After that I run the following code to combine that array with the actual training data and I am getting an error.

da_conv_feat2 = np.concatenate([da_conv_feat, conv_feat])

Here is the error which I am getting. Any ways to deal with it?

MemoryError Traceback (most recent call last)
in ()
----> 1 da_conv_feat2 = np.concatenate([da_conv_feat, conv_feat])


Would appreciate the help.

@jeremy in the statefarm.ipynb you say:

I’m shocked by how good these results are! We’re regularly seeing 75-80% accuracy on the validation set, which puts us into the top third or better of the competition.

I don’t understand how you came up with the 75%-80% accuracy number based on the output above the quote. Could anyone explain?

Hey @adhamh,

I think he’s refering to the column “val_acc”, in the 15 epochs run, where it peaks over 74-78%.


How do we know that an image is that of a particular driver? Is it encoded int he filename? I cannot make out how to derive it.

Hey all,

I’m having issues submitting to kaggle.

All of my submissions are getting rejected, with the error message “Evaluation Exception: Submission must have 79726 rows”

When I run the following command on my file

cat state-farm-submission.csv | wc
I do indeed get
79726 79726 4548287

Help? What am I missing?

The code I’m using to generate the file is the following: