Statefarm - BcolzArrayIterator and MixIterator

Hi all!

So I’ve been having trouble getting the above mentioned tools to work. I am saving VGG19 (keras) convolutional results with bcolz and loading them back to compute Dense/Convolutional tops.

It works great for the training and validation data (99.64% val accuracy) but I can’t import the test features because of the size of the file. On top of that, I have 120k augmented samples that are facing the same issue.

I’ve tried using BcolzArrayIterator to feed this data back into the second half of the model, but accuracy doesn’t go above random, which suggests to me I am doing something wrong.

I’m also trying to use MixIterator to combine the two BcolzArrayIterator (training and augmentation) for my initial training (which would be followed by the test predictions for pseudo labeling) but am getting an error stating that the BcolzArrayIterator is not compatible.

I really want to use data augmentation and pseudo labeling here (not to mention submitting my results) and this is really making that difficult

Has anyone made this work? I don’t want to break things multiple files because I feel like that is avoiding the bigger question of how to “stream” data in to your model to train and make predictions.

EDIT: Also - can someone point me in the direction of an example of how MixIterator works? I thought I knew how to use it, but after these troubles am pretty convinced I do not. Thanks!

Hi… I’m assuming you tried predict_generator with Bcolz batches of test convolution features and found your submissions to have poor accuracy.
From how you explained your Bcolz iterator usage, it sounds similar to what I did. Due to my tiny RAM, I had to use BcolzIterator for train and val data as well. I’m not sure where you went wrong but I would suggest you to change your model.fit to use Bcolz batches and fit_generator with train and val convolution features and verify if you get the same accuracy as without batches. Might give you a clue if there’s any mistake with the BcolzIterator usage.

I actually had to give up using bcolziterator because it was somehow not working right. I would get nearly random accuracy with it that suggests it wasn’t reloading the data correctly. I instead did the following:

  1. My augmented convolutional features are saved in multiple bcolz files that are the size of the training error.

  2. Used train on batch to compute convolutional features using train_on_batch and then had an option to take that prediction and load it into a pandas dataframe (for test data).

It works and I have working models now, but I wish I could figure out this streaming data a little better.

If you post your code I would look at it to see if there are any issues that I’m seeing. Really tough to debug with no code though. I had an issue that had the same results as yours (train and valid look good, test is crap). My issue was I was matching up the filename and prediction incorrectly because flow_from_file doesn’t shuffle the filenames, but shuffles the images by default. I’m not sure if bcolz has a similar setting, but like I said, if you post some code, I would be willing to look into it because it will probably help me long-term as well.

I’ve moved into part two and think I know what to do now. If not I’ll post my code.

  1. Precompute training data and into bcolz array. Also put training labels into bcolz array. Can compute augmented data into its own bcolz array as well

  2. Call get_batches on bcolz training array and aug data

  3. Put batches into Mixiterator

I’ll try it later tonight and see if I can get it working