Help: Trouble using get_preds for kaggle submissions

Hello everyone,

Might be a noob question that was already answered before, but here goes.

I currently am working through chapters 5 and 6 of fastbook and trying the tricks out on a kaggle contest. Since kaggle primarily provides the image names and targets in .csv files, I used the approach recommended in the book as shown below:

dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                   get_x=get_x, get_y=get_y, 
                   splitter=ColSplitter(),
                   item_tfms=Resize(460),
                   batch_tfms = [Resize(224), Normalize.from_stats(*imagenet_stats)])
dblock.summary(df)

dls = dblock.dataloader(train_df)

Here get_x and get_y are defined as below:

def get_x(r): return os.path.join(configs["train_img_dir"], r["Image"])
def get_y(r): return r["Id"]

This works great and I’m able to train and validate the model just fine. Now comes the tricky part. When I want to create a submission, I have to read a csv for the file names sample_submission.csv and then create predictions accordingly.

Here’s what I’ve tried so far:

  1. When I tried to use

    test_dl = dls.test_dl(test_df)
    preds, _  = learner.get_preds(test_dl)
    

I got an error that there was no image matching the path and I realized that this was because get_x was referencing the train directory and not the test.

  1. I tried creating a new dataloader, test_dls with a new get_test_x function for getting the correct image paths but that didn’t work

  2. I also tried dls.test_dl(get_image_files(configs["test_img_dir"])) . This worked but I don’t know which file order is being used by get_preds so I can’t line up the predictions with the submission csv

  3. My current approach is to use learner.predict(img) for each image in the submission csv and this works but is painfully slow (15-20 minutes for all predictions)

Earlier threads related to this question in this forum show approaches using older versions of FastAI.

I would really appreciate any help or pointers here.

Thanks!

HI, try this

testdl = learn.dls.test_dl(test_df, shuffle=False)
preds = learn.get_preds(dl=testdl)
1 Like

Hi @Mugnaio ,

Thanks for the response. When I tried that, I got an error that the image files could not be found. This is mostly due to the fact that my get_x function appends the train directory as a prefix to the file name and my test images are in the test directory.

Would you have ideas on workarounds for this? Should I instead have a separate “path” column in my df so that I can modify my get_x function as

def get_x(r): return os.path.join(r["dir"], r["Image"])

I see. Maybe I would modify the df to include the folder as you wrote. I’m not sure this is the best solution but it should work.

I’m going to give that a go now. Will let you know how that turns out. :slight_smile:

Ok, so here’s how I fixed it:

  1. Add a column in the train and test df as follows:
    train_df["img_path"] = configs["train_img_dir"] 
    test_df["img_path"] = configs["test_img_dir"]
    
  2. Change the get_x function to
    def get_x(r): return os.path.join(r["img_path"], r["Image"])
    
  3. When running inference, use:
    testdl = learn.dls.test_dl(test_df, shuffle=False)
    preds, _ = learn.get_preds(dl=testdl)