Help: Trouble using get_preds for kaggle submissions

sairam6087 · June 10, 2021, 5:23pm

Hello everyone,

Might be a noob question that was already answered before, but here goes.

I currently am working through chapters 5 and 6 of fastbook and trying the tricks out on a kaggle contest. Since kaggle primarily provides the image names and targets in .csv files, I used the approach recommended in the book as shown below:

dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                   get_x=get_x, get_y=get_y, 
                   splitter=ColSplitter(),
                   item_tfms=Resize(460),
                   batch_tfms = [Resize(224), Normalize.from_stats(*imagenet_stats)])
dblock.summary(df)

dls = dblock.dataloader(train_df)

Here get_x and get_y are defined as below:

def get_x(r): return os.path.join(configs["train_img_dir"], r["Image"])
def get_y(r): return r["Id"]

This works great and I’m able to train and validate the model just fine. Now comes the tricky part. When I want to create a submission, I have to read a csv for the file names sample_submission.csv and then create predictions accordingly.

Here’s what I’ve tried so far:

When I tried to use

test_dl = dls.test_dl(test_df)
preds, _  = learner.get_preds(test_dl)

I got an error that there was no image matching the path and I realized that this was because get_x was referencing the train directory and not the test.

I tried creating a new dataloader, test_dls with a new get_test_x function for getting the correct image paths but that didn’t work
I also tried dls.test_dl(get_image_files(configs["test_img_dir"])) . This worked but I don’t know which file order is being used by get_preds so I can’t line up the predictions with the submission csv
My current approach is to use learner.predict(img) for each image in the submission csv and this works but is painfully slow (15-20 minutes for all predictions)

Earlier threads related to this question in this forum show approaches using older versions of FastAI.

I would really appreciate any help or pointers here.

Thanks!

Mugnaio · June 10, 2021, 6:27pm

HI, try this

testdl = learn.dls.test_dl(test_df, shuffle=False)
preds = learn.get_preds(dl=testdl)

sairam6087 · June 10, 2021, 6:54pm

Hi @Mugnaio ,

Thanks for the response. When I tried that, I got an error that the image files could not be found. This is mostly due to the fact that my get_x function appends the train directory as a prefix to the file name and my test images are in the test directory.

Would you have ideas on workarounds for this? Should I instead have a separate “path” column in my df so that I can modify my get_x function as

def get_x(r): return os.path.join(r["dir"], r["Image"])

Mugnaio · June 10, 2021, 7:31pm

I see. Maybe I would modify the df to include the folder as you wrote. I’m not sure this is the best solution but it should work.

sairam6087 · June 10, 2021, 7:37pm

I’m going to give that a go now. Will let you know how that turns out.

sairam6087 · June 11, 2021, 12:46am

Ok, so here’s how I fixed it:

Add a column in the train and test df as follows:

train_df["img_path"] = configs["train_img_dir"] 
test_df["img_path"] = configs["test_img_dir"]

Change the get_x function to

def get_x(r): return os.path.join(r["img_path"], r["Image"])

When running inference, use:

testdl = learn.dls.test_dl(test_df, shuffle=False)
preds, _ = learn.get_preds(dl=testdl)