Vgg.get_batches is returning more classes than I expect

Hello,

After successfully submitting to dogs_vs_cats redux by working through lessons 1 and the dogs vs cats redux notebook I thought it would be a good learning experience to do it all over with my own small data set (and it really has been!)

The Issue I am having is that vgg.get_batches() is returning one more class than it should (I only have two classes, it tells me I have 3)

I have mimicked the directory setup

  • /train/
  • class1/
  •    class1.1.jpg
    
  •    class1.2.jpg
    
  •    class1.n.jpg 
    
  • class2/
  •    class2.1.jpg
    
  •    class2.2.jpg
    
  •    class2.n.jpg
    

When I run…

#Fine tune the model
batches = vgg.get_batches(train_path, batch_size = batch_size)
val_batches = vgg.get_batches(valid_path, batch_size = batch_size*2)

vgg.finetune(batches)

vgg.model.optimizer.lr = 0.01

I get…

Found 34 images belonging to 3 classes.
Found 10 images belonging to 2 classes.

My “valid” directory and my “train” directory have identical folders in them.

Here is the path I am defining…

#Change the directory
%cd $DATA_HOME_DIR

#create relative path names
path = DATA_HOME_DIR + '/'
test_path = path + '/test/'
results_path = path + '/results/'
train_path = path + '/train/'
valid_path = path + ‘/valid/’

So I guess my question is, does vgg.get_batches determine the number of classes based on the number of folders in the directory? I looked at vgg16.py, tried to understand the get_batches, from there I went to Keras trying to track down my issue, but after a couple hours of troubleshooting around on this, I thought I would reach out here.)

Thanks for any help

Yes, get_batches returns classes based on the number of directories. So maybe checking that would help. I’ve have not really noticed it returning extra classes for me if i get my directory setup right. so in this case it should ideally be
train/
class1/
class2/
and the same for valid. for test data you can put all images in an ‘unknown’ folder.

1 Like

Thanks @karthik_k314 - this will keep me on the right direction!

I know if I add a directory folder (just an empty one), vgg.get_batches shows an extra class. It is almost like I have a hidden directory in my train/ directory that get_batches see’s, but I don’t. I have used the terminal to list all the directories in the train folder, and it only shows two. I’ll figure it out yet :slight_smile:

Thanks again!

Well - I found out the issue - or at least I tried something different and it worked.

I am writing this here so that hopefully if someone has the same issue in the future, it can help them.

When I created my /train directory, I actually created the directory using the web browser file structure in the jupyter “environment” i.e., like below:

I did this because I am not to great at bash commands (yet), and I was having trouble loading a zip file of my personal data set from my computer to the AWS server via bash. I was able to make directories fine with a jupyter notebook - but for the above /train folder, I created it using the jupyter “environment” (though I don’t think I am using the right words here, I hope you know what I mean.

So what I did was manually create the /train folder in jupyter (like I said above) and then manually upload the photos using the upload feature. It all seemed fine, until vgg.get_batches seemed to find 1 more class than I had in the train/ directory.

I ended up creating a new directory using bash, copying the files over (via bash) and now every things is working as expected.

So I guess what I am saying is, that if you use the jupyter web “environment” (I am not referring to commands inside a jupyter notebook), then vgg.get_batches might find a “hidden folder” which it deems a class.

I could be totally off here - but I thought I would throw it out there in case it might help someone in the future.

2 Likes

Had the same problem. Only solution was to completely delete the directories and recreate in Unix. Jupyter notebook must have a bug.

1 Like