Issue with trying to create databunch from datablock api for a multilabel image set

dreadloaf · May 31, 2019, 1:20am

Hello everyone, I have been stumped on this issue for quite a while so help would be very much appreciated. First and foremost, here is the error I am getting:

AssertionError                            Traceback (most recent call last)
<ipython-input-77-5021e6278664> in <module>
      3         .split_by_rand_pct()
      4         #How to split in train/valid? -> randomly with the default 20% in valid
----> 5         .label_from_df(label_delim=' ')
      6         #How to label? -> use the second column of the csv file and split the tags by ' '
      7         .databunch())        

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in _inner(*args, **kwargs)
    461         assert isinstance(fv, Callable)
    462         def _inner(*args, **kwargs):
--> 463             self.train = ft(*args, from_item_lists=True, **kwargs)
    464             assert isinstance(self.train, LabelList)
    465             kwargs['label_cls'] = self.train.y.__class__

/opt/conda/envs/fastai/lib/python3.6/site-packages/fastai/data_block.py in label_from_df(self, cols, label_cls, **kwargs)
    268         "Label `self.items` from the values in `cols` in `self.inner_df`."
    269         labels = self.inner_df.iloc[:,df_names_to_idx(cols, self.inner_df)]
--> 270         assert labels.isna().sum().sum() == 0, f"You have NaN values in column(s) {cols} of your dataframe, please fix it."
    271         if is_listy(cols) and len(cols) > 1 and (label_cls is None or label_cls == MultiCategoryList):
    272             new_kwargs,label_cls = dict(one_hot=True, classes= cols),MultiCategoryList

AssertionError: You have NaN values in column(s) 1 of your dataframe, please fix it.

data = (ImageList.from_csv(Path('/storage/bird-sounds/'), 'labels.csv', folder='train')
        .split_by_rand_pct()
        .label_from_df(label_delim=' ')
        .databunch())

Here is the snippet causing the issue:

data = (ImageList.from_csv(Path('/storage/bird-sounds/'), 'labels.csv', folder='train')
        #Where to find the data? -> in planet 'train' folder
        .split_by_rand_pct()
        #How to split in train/valid? -> randomly with the default 20% in valid
        .label_from_df(label_delim=' ')
        #How to label? -> use the second column of the csv file and split the tags by ' '
        .databunch())

Here is a snippet of my labels.csv:

Filename,songs
nips4b_birds_trainfile001.png,Butbut_call Erirub_call Parate_call
nips4b_birds_trainfile002.png,Sylmel_song
nips4b_birds_trainfile003.png,Petpet_song Sylcan_song
nips4b_birds_trainfile004.png,Erirub_call Prumod_song Turmer_call
nips4b_birds_trainfile005.png,Erirub_call Prumod_song Turmer_call
nips4b_birds_trainfile006.png,Phofem_song Tetpyg_song
nips4b_birds_trainfile007.png,Fricoe_song Gargla_call Parate_song Siteur_song Turmer_song
nips4b_birds_trainfile008.png,Galthe_call

The error suggests that the labels have to be numbers, if this is the case, how would I use strings as labels?

Thank you in advance!

dipam7 · May 31, 2019, 5:10am

Hey, the error says that you have missing values in your labels.csv. You can do the following to check that.

read the df
df = pd.read_csv('labels.csv)

check columns that have null values
df.isna().any()

You will have to replace the null values with something in order to proceed. You will have to go back to the dataset to figure out what that something is. It can be a default value you use for all of these. You can also remove the rows with null values if there aren’t many. Read about the various ways to deal with null values on the internet. Cheers.