Statefarm kaggle comp

Philipp · March 24, 2017, 7:45am

I tried the opposite of pseudolabelling, I kicked out all training data that ‘confused’ the network. I thought that means pictures that are a) mislabelled or b) weird in another way (for example: a picture of a guy labelled as ‘driving normally’ who has his hand next to his head to adjust his glasses looks a lot like he’s on the phone).

I predicted classes for all training pictures and removed those where the highest class ‘probability’ was below 0.9, that removes about 8% of training pictures. I then trained the same model from scratch after restarting the kernel.

I get a slighly better loss on the test set:

Before removing pictures:
Private score: 0.94359
Public score: 1.18213
After removing pictures:
Private score: 0.92506
Public score: 1.11814

Is this difference small enough to be due to random initialization of weights? Is it normal to test your training set to remove noise?

I put the notebook here: https://github.com/philippbayer/cats_dogs_redux/blob/master/Statefarm.ipynb

jeremy · March 24, 2017, 4:31pm

@Philipp - I like the way you’re thinking about this. The good news is that there’s a terrific paper that talks about how to deal with just that issue: http://arxiv.org/abs/1412.6596 . The ‘soft loss’ they discuss is similar to your idea - but it’s done in a more dynamic way during training.

justinho · April 18, 2017, 3:07am

Does the dense layer output dimension affects the result?

Hi guys, I’m working on statefarm using the vgg16 conv_model to precompute the conv_feature, and then define the bn_model just like jeremy’s noteboook.

When I first run the cell, the result is terrible, the val acc is always less then 9%, but when I change the dense layer output dimension form 128 to 100, the val acc suddenly pop to 14%, does the dense layer output dimension affcects the result? If yes, how it affects the result?

Btw, I used the same architecture, with p=0.5, dense(100), lr=1e-6, after running 72 epoch, I got this follow result:

How’s that look like? I think it’s overfitting right? I’m working on the data augmentation, let me see if it can help.

Edit1:
I used the data augmentation, but I can only add the aug data that 3 times larger than training set. @jeremy said not to concatenate the aug data to the conv_feature directly , otherwise it will raise memory error, and he’ll introduce new method to concatenate the data. But I can’t find this method…

Anyway, after I tripled the data, I set the dropout param to 0.8 and run 4 ecophs, I got this result:

Edit2:
I submit the result to kaggle, what ??? Kaggle loss is different from my val_loss ? Any suggestions ?

davidc · April 25, 2017, 11:24pm

I have been running the statefarm-sample.ipynb where a model is build up using 1500 sample images from the main training set. The model by the end of the notebook is hitting 50% accuracy when it was run by the previous user - that is I can see the old results in the notebook before I run it again on my copy of the data. When I run the notebook I get very high accuracy - for example where jeremy get 50% I get much higher

1500/1500 [==============================] - 26s - loss: 1.2677 - acc: 0.9307 - val_loss: 1.4245 - val_acc: 0.8360

With this block:

model = Sequential([
        BatchNormalization(axis=1, input_shape=(3,224,224)),
        Flatten(),
        Dense(100, activation='relu'),
        BatchNormalization(),
        Dense(10, activation='softmax')
    ])
model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=2, validation_data=val_batches, 
                 nb_val_samples=val_batches.nb_sample)
model.optimizer.lr = 0.01
model.fit_generator(batches, batches.nb_sample, nb_epoch=5, validation_data=val_batches, 
                 nb_val_samples=val_batches.nb_sample)

I get:
1500/1500 [==============================] - 25s - loss: 0.1288 - acc: 0.9940 - val_loss: 0.4701 - val_acc: 0.9050

much higher than expected so something’s wrong these results are too good. How is this possible?

edaena · May 21, 2017, 3:22am

Did anyone try retraining several layers of VGG and use it for the StateFarm sample?
I’m getting very low accuracy: loss: 14.5922 - acc: 0.0947 - val_loss: 14.5546 - val_acc: 0.0970
I was wondering if this result was also expected even when retraining several layers.

This is what I did:
Using learning rate 0.01, and working on the sample

Correctly split the dataset (I removed 3 drivers and use them for validation)
Created vgg model, fine-tune last layer, fit model
Retrain Dense layers as the example in lesson2.ipynb, fit model

dukeofyork · May 21, 2017, 6:50pm

i was having a similar issues with low accuracy, it may be that your training examples are shuffled a different way form your labels. you have to set shuffle=False when precomputing layers. otherwise each label will not match each training example.

kensmithzzz · May 25, 2017, 7:00pm

This was exactly the same problem I was having. In my case, reducing the learning rate made the model train. See the other thread where I mention this here:

Borz · June 5, 2017, 4:26am

I’ve been having a strange issue trying to reproduce the statefarm.ipynb. I’m getting a strange “MemoryError:” when trying to create convolutional test features. Everything’s working smoothly with training & validation features, but for some reason predict_generator(..) just won’t work on the test_batches.

I’ll put a link to my notebook below, & attach relevant screenshots of it. If I figure it out I’ll edit this post. This is running Keras 1.2.2, Theano 0.9.0, Python2.7

Showing how the notebook was setup at start:

Showing how the convolutional model was instantiated & the train/valid features working properly:

Finally the problem portion: the cryptic MemoryError: when running predict_generator(..) on the test set:

Out of Memory? Doubtful: I get the error even with a batch size of 1. There’s no change in system RAM, but there is a small (< 40%), ~1 second rise in CPU activity. So the computer is trying to set up something but it’s falling apart.

The GPU doesn’t seem to play a part, via watch -n 0.5 nvidia-smi, its RAM isn’t changing, and it’s easy enough to force a GPU out-of-memory error by bumping the batch_size up to 64 or so.

I tried setting class_mode='categorical' just to see, & no change. It looks like @Idano’s github gist was using class_mode='binary'. I tried that too with no change.

So the question then is what’s so special about the training set that it’s causing problems? It’s path is (path)/test/unknown/<~79k imgs>

I’m going to go over a few successful classmates’ notebooks & see if things work. I’m hoping the ghost of Bill Gates isn’t haunting me on Linux, but we’ll see.

Link to my (current) notebook on github: https://github.com/WNoxchi/Kaukasos/blob/master/FAI/statefarm_full_codealong.ipynb

Update: ran @idano’s code on getting convolutional test features from predict_generator. Same error. I’m wondering if there’s an issue with directory structure… but previous class examples had the test data in ../test/unknown/<imgs> as well…

Okay, the traceback’s pointing to .pyc files… these are compiled python files yeah? Maybe something got put together wrong. That’s my Linux machine, I’ll see if predict_generator will work on test_batches on my Mac.

Update: Well here’s something interesting. Running only the conv-layer test-features part from the statefarm notebook on my Mac… so far so good… Not sure if I’ll run it to completion since this is only via CPU… but it hasn’t crashed or shown any signs of anything wrong. It’s just working. I could try switching to CPU on my Linux machine to build test-features… seems clunky… but this is pretty far into the weeds already. I could check the versions of all packages in my fastai conda environments on the Mac vs Linux… but besides that there seems to be 2 viable workarounds:

Build test predictions through a completed (Conv layers connected to FC blocks) model only
Switch to CPU-backend, build conv_test_features, switch back to GPU to continue. (not tested yet)

Mac’s CPU’s chuggin’ along:

If anyone has any insight, much appreciated. Otherwise, I’m going to leave this as is w/ the workarounds and continue on.

Update June 6: This is very strange. Same memory error on the Linux machine even when using CPU. If having both conv_feat & conv_val_feat in memory uses over 10.5/16 GB at ~20k imgs, then the 80k test set is definitely going over… but that doesn’t explain why it would work on my mac which has half the RAM.

Yeah, even trying to save (bcolz) straight to disk doesn’t work. Figure this way it’d still need to hold everything in memory at once, so that may be why:

save_array(path + '/results/conv_test_feat.dat', conv_model.predict_generator(test_batches, test_batches.nb_sample))

Update June 7: Neverending story, eh? I keep coming back to this because it feels very important… generating conv-test-features … and no good reason for failure.

I ran essentially the same code in the above example, on the cats-dogs-redux test data. It works just fine. 1 note of interest: total RAM usage at 7.5/16GB… For 12,500 images? What about for StateFarm’s 80,000? Yeah. Doesn’t fit in memory.

Well after that journey I know what question to ask.

TLDR:
How do I save convolutional test features directly to disk, in batches, as they are created, so that I don’t run into memory limits?

I didn’t get this error on my Mac, which only has 8GB of RAM, but maybe there’s a difference in how OSX vs Ubuntu handle memory in this case? I’m thinking I’d cause a MemoryError Mac-side too, if I let the predict_generator run long enough to make enough test-features. I think I’ll test that when I’m not working on this machine.

bahram1 · June 8, 2017, 4:07am

maybe State Farm Full: how not to run Out of Memory with VGG + (da_batches.samples*5)?

Borz · June 8, 2017, 3:51pm

This is great, thanks Bahram. Thing is, this is training the model - I just want to create intermediate features without blowing up the RAM. Looking through documentation & forums now, but haven’t quite found the way how.

So now I know how to create features/predictions on individual or on a specific number of batches:

for batch in test_batches: ...
or
for ...: test_batches.next()
or
xyz = model.predict_generator(test_batches, step)
where step is (I think) a multiple of batch_size, and/or ≥ batch_size and ≤ dataset size.

Now how are these saved to be used for the next step?

I just found out that bcolz saves things in a directory structure, not a single file - so it has a way of keeping track of what’s where. Again, feels like getting too far into the weeds.

There has to be a way to take what @bahram1 said and save features to disk as they’re created.

Update:

Oh hello…

http://bcolz.readthedocs.io/en/latest/reference.html#bcolz.carray.append

append(self, array) Append a numpy array to this instance.

…
http://bcolz.readthedocs.io/en/latest/reference.html#the-carray-class

The carray class
class bcolz.carray
A compressed and enlargeable data container either in-memory or on-disk.

carray exposes a series of methods for dealing with the compressed container in a NumPy-like way.

That sounds a lot like what I was asking for. Will update based on what I find. If anyone has wisdom, please feel free to share.

Update: feels like I’m getting closer.

Same link as above, bcolz.carray class:

rootdir : str, optional

The directory where all the data and metadata will be stored. If specified, then the carray object will be disk-based (i.e. all chunks will live on-disk, not in memory) and persistent (i.e. it can be restored in other session, e.g. via the open() top-level function).

That looks promising. Taking a look at load/save_array in utils.py & the documentation at Library Reference — bcolz 1.2.0 documentation, bcolz.carray.flush() is how bcolz actually saves data to disk. I at first thought it was cleaning a buffer/stream like in C++. Nope.

Furthermore, in the first line of utils.save_array:

bcolz.carray(arr, rootdir=fname, mode='w')

‘w’ erases & overwrites whatever was at fname, but ‘a’ just appends. However that’s for a ‘persistent carray’. Specifying rootdir makes the carray disk-based…

Both are specified in the utils.py implement., so I’m going to guess and say ‘persistent’ isn’t necessarily limited to memory: it just exists… which makes me think… if Howard is using 'w', bcolz isn’t keeping the carray in memory, that’s just convolutional-features variable living in memory. Hopefully carray.flush() doesn’t torpedo this line of thinking, and bcolz is doing some other buffer-witchcraft that doesn’t require keeping everything in memory at once. Fingers crossed.

So, the point?:

Maybe I can use ‘rootdir=..’ and ‘a’ to write my test-convolutional-features to disk using the bcolz carray, as they are created batch by batch. We’ll see.

Update June 9: Done.

Finally got it working; submitted predictions - which also blew my previous best out of the water.

The code to save convolutional features to disk as they are created in batches:

fname = path + 'results/conv_test_feat.dat'
# %rm -r $fname     # if you had a previous file there you want to get rid of. (mode='w' would handle that maybe?)
for i in xrange(test_batches.n // beatch_size + 1):
    conv_test_feat = conv_model.predict_on_batch(test_batches.next()[0])
    if not i:
        c = bcolz.carray(conv_feat, rootdir=fname, mode='a')
    else:
        c.append(conv_test_feat)
c.flush()

The code for generating predictions on the saved convolutions:

idx, inc = 4096, 4096
conv_test_feat = bcolz.open(fname)[:idx]
preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
while idx < test_batches.n - inc:
    conv_test_feat = bcolz.open(fname)[idx:idx+inc]
    idx += inc
    next_preds = bn_model.predict(conv_test-Feat, batch_size=batch_size, verbose=0)
    preds = np.concatenate([preds, next_preds])
conv_test_feat = bcolz.open(fname)[idx:]
next_preds = bn_model.predict(conv_test_feat, batch_size=batch_size, verbose=0)
preds = np.concatenate([preds, next_preds])

And that’ll do it. Few notes: of course, this assumes the usual imports, that stuff like test_batches are already defined, and etc. My inability to open an already existing bcolz carray by defining c = bcolz.carray(..), regardless of mode, & success in using c.append(..) after the carray is opened, makes it clear that the first code block can be cleaned up, especially to remove the if-else block. Also, idx and inc, index & increment, are user-defined; I picked them because they seemed big enough to not take too many disk-accesses, but small enough to not put too much into memory at once. Lastly, I take the zeroth index of test_batches.next() because the generator returns as a tuple

Perhaps a few other notes, but that’s all off the top of my head at the moment. I’d love to see a ‘proper/pro’ way to do this (something straight out of keras would be nice!) from J.Howard or someone, but: it works. Looks like it’ll work for big stuff. It’s unconstrained by memory limits (video-mem not-included), so I’m happy with it.

Ah, another note: Doing the above, and running it through the bn_model after training that for I think only 5 epochs (1x 1e-3, 4x 1e-2), got a Kaggle score of 0.70947 at 415/1440 ranking. That’s top 28.9%.

Another thing I haven’t tested is using .fit_generator(..) on conv train/valid features pulled from disk, but that shouldn’t be a huge hassle compared to the above. May update this post down here to include jupyter notebooks for a full implementation, later.

Alright, that’s about it for this one!

LaurentH · June 13, 2017, 8:29pm

So I’ve got two questions regarding the statefarm notebook. I’ve tried searching whether this had already been answered, but the Discourse search function isn’t wildly powerful, so there’s a chance this was a duplicate.

My question pertains to the model setup, I’ve included a copy below:

def get_bn_layers(p):
return [
    MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
    Flatten(),
    Dropout(p/2),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(p/2),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(p),
    Dense(10, activation='softmax')
    ]

Here’s my two questions:

Why do we split the convolutional part of the model from the fully-connected part, right at a MaxPooling2D layer? I could understand it if it were split right after the Flatten() layer, but I’m failing to understand this particular choice. My argument for splitting after the Flatten layer is that the only layers that need to be re-trained are the fully connected ones. There’s no real reason to ‘retrain’/include the MaxPooling2D and Flatten(). Or is there?
Why do the BatchNormalisation layers come after the Dense layers? From what I gather of what BatchNormalisation does, is that it re-centers/regularises inputs. Wouldn’t we want to regularise the inputs before putting them into an activation layer? To quote the paper (emphasis mine):

We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs

Why does the notebook decide to do this after the Dense layers?

LaurentH · June 14, 2017, 12:34pm

Well, I’ve figured out the answer to my 2nd question. Though it’s up for debate, it’s mostly preferable to apply a BN layer after the activation. Here’s the Keras author discussing it, there’s more links in there for those who are curious: https://github.com/fchollet/keras/issues/1802.

dilan · June 17, 2017, 1:01am

I’m having trouble downloading the statefarm dataset. I have accepted the terms, but the trying to unzip the downloaded zip files give the following error:

nbuser@jupyter:~/nbs/data$ unzip sample_submission.csv.zip
Archive:  sample_submission.csv.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.

What’s going on?

johnlu · June 17, 2017, 11:56am

I am having the same issue. I downloaded the files directly from the website, and looked at them in a hex viewer since I couldn’t unzip them.
All that the files contain are zeros.

jimmyloyola · June 17, 2017, 5:52pm

Any luck so far? I am getting the same error. I saw in the kaggle forums couple of people have posted with the same error.

johnlu · June 17, 2017, 7:42pm

I’m afraid not, I contacted kaggle support, but I don’t expect to hear anything from them during the weekend.

jimmyloyola · June 18, 2017, 4:22pm

Ahh. I tried downloading today also. No Luck… Hopefully it will get resolved tomorrow.

johnlu · June 20, 2017, 7:04am

I was able to download it today

jimmyloyola · June 21, 2017, 11:07am

Me too. Guess it got fixed.

jimmyloyola · June 21, 2017, 8:41pm

Has anyone encountered memory issue for Statefarm in aws?

I store the predictions of convolution features in array(It will be 4 times the actual size of training data because of data augmentation). After that I run the following code to combine that array with the actual training data and I am getting an error.

da_conv_feat2 = np.concatenate([da_conv_feat, conv_feat])

Here is the error which I am getting. Any ways to deal with it?

MemoryError Traceback (most recent call last)
in ()
----> 1 da_conv_feat2 = np.concatenate([da_conv_feat, conv_feat])

MemoryError:

Would appreciate the help.