MultiCategoryBlock not working on Kaggle

pvial · November 20, 2023, 2:47pm

Hello,
[EDIT: I found a way to made the code work in the doc, namely:

dls0 = ImageDataLoaders.from_df(df, path, folder='train', valid_col='is_valid', label_delim=' ',
                               item_tfms=Resize(460), batch_tfms=aug_transforms(size=224))

but this doesn’t explain why the code below should fail. Not sure if should open an issue]
I’m doing a walk through on Chapter 6 with Kaggle (having no possibility to run fastai code locally)
It turns out that, for now, the code is bugging as soon as I specify that my datablocks handle MutliCategory problems. I made a https://www.kaggle.com/pierrevial/multicategory-attempt with a minimal example. I give the whole code below this post because I may remove this notebook in the future.
As you may see, strange errors appear whether multicategory is specified (files not found), or not specified (maximum recursion depth when calling show_batch)
So, I would really appreciate if someone could tell me what’s going wrong here!
Best,
Pierre

!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
from fastbook import *

from fastai.vision.all import *
path = untar_data(URLs.PASCAL_2007)
df = pd.read_csv(path/'train.csv')
df.head()

def splitter(df):
    train = df.index[~df['is_valid']].tolist()
    valid = df.index[df['is_valid']].tolist()
    return train,valid

dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
     get_x = lambda r: r['fname'],
     get_y = lambda r: r['labels'],
     splitter=splitter)

# Note: If [blocks=(ImageBlock, MultiCategoryBlock)] is commented, 
# then the variables [dsets] and [dls] below are well-defined (in the sense
# they do not cause an error)

dsets = dblock.datasets(df)
# error: file not found

dls = dblock.dataloaders(df)
# error: file not found

dls.show_batch(nrows=1, ncols=1)
# If [blocks=(ImageBlock, MultiCategoryBlock)] is commented while defining [dblock] above,
# [dls] and [dsets] do work, but [dls.show_batch(nrows=1, ncols=1)] fails with the 
# following error:
# maximum recursion depth exceeded while calling a Python object

vbakshi · November 20, 2023, 3:56pm

I think I figured out a solution, following the notebook 06_multicat.ipynb:

The reason it’s throwing the file-not-found error is because get_x needs to return the full path to the image file, not just the name. Here are the two “getter” functions defined in the lesson notebook:

def get_x(r): return path/'train'/r['fname']
def get_y(r): return r['labels'].split(' ')

That solves the file-not-found error but .show_batch then throws a different error (tensors are not the same size) because the images in the dataset are different sizes and need to be standardized, which can be done by passing the following item_tfms when creating your DataBlock:

item_tfms = RandomResizedCrop(128, min_scale=0.35)

So, in summary, here’s the full code to build your DataBlock:

def splitter(df):
    train = df.index[~df['is_valid']].tolist()
    valid = df.index[df['is_valid']].tolist()
    return train,valid

def get_x(r): return path/'train'/r['fname']
def get_y(r): return r['labels'].split(' ')

dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   get_x = get_x, 
                   get_y = get_y,              
                   # get_y = ColReader('labels', label_delim=' '),
                   splitter=splitter,
                   item_tfms = RandomResizedCrop(128, min_scale=0.35)
                  )

When I run that, show_batch outputs correctly:

pvial · November 20, 2023, 4:49pm

Thanks a lot @vbakshi! This does work indeed.
I’m still wondering whether, if run locally, the code from the Fastbook notebook should work exactly as it is written and whether it’s worth writing a pull request for the chapter (at least including your remark in the markdown).