Confused about the output of 'predict' (unsolved)

eof · January 9, 2019, 9:06pm

When I call predict on an image, I am getting a 3tuple:

the predicted category
a tensor of shape [1]
a tensor of shape size_of_validation_set

I am confused by a couple of things, why are their multiple identical categories even after loading into a set? (notice ‘new whale’ comes up multiple times).

In general, how can I get the the “top N mostly likely categories”. My original thinking is that I would zip the output of the probability tensor with the labels from the databunch, but with multiple categories labeled the same thing i am completely confused.

I feel like I am missing a fundamental insight, so I would be happy to be pointed toward something to read. Thanks

Here is an image of what i am talking about

eof · January 9, 2019, 9:23pm

The whole notebook, with training: https://github.com/gdoteof/neuralnet_stuff/blob/master/kaggle_whales.ipynb

sariabod · January 9, 2019, 9:48pm

Radek has a really good notebook for the whales competition:

In his utils.py file he has a couple functions to create the predictions. This should help sorting out what is being returned and how to get the TOP N labels.

def top_5_preds(preds): return np.argsort(preds.numpy())[:, ::-1][:, :5]

def top_5_pred_labels(preds, classes):
    top_5 = top_5_preds(preds)
    labels = []
    for i in range(top_5.shape[0]):
        labels.append(' '.join([classes[idx] for idx in top_5[i]]))
    return labels

eof · January 10, 2019, 7:30pm

Thank you @sariabod

I was/am super excited to see that folder. Unfortunately it gets my right back to the same place. My problem is that I don’t understand… something. let me explain:

The default “fast ai way” of creating a databunch, the same way that radek does in his notebooks there, no longer works for this dataset (an update that came in december started looking for problems in the databunch). The problem is there are so many cases of one example of a category that the random train/val split always puts at least one class in the validation set that is not in the train, which results in an error:

Exception: Your validation data contains a label that isn't present in the training set, please fix your data.

I was eventually able to get around this, but now, because of the results I am getting from my ‘predict’ function, I believe I may have created my databunch wrong.

eof · January 10, 2019, 7:31pm

To be clear, radek’s “first submission” notebook is no longer “compiling” or whatever you call a notebook running successfully.

sariabod · January 10, 2019, 10:14pm

Hello @eof

I misunderstood your initial issue. This dataset is very imbalanced, I also ran into issues doing a random split. I ended up splitting the data at the label level and skipping labels that did not have enough images. I will have to look at your code later tonight to see if I can spot the issue. Below is a link to a NB on how I solved the problem in the meantime (I know this doesn’t solve your issue but I figured it couldn’t hurt).

This is run after running test_val_split.py to break up the images.

github.com

sariabod/playground/blob/master/fastai.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.vision import *\n",

This file has been truncated. show original

eof · January 11, 2019, 5:10pm

If you do get a chance to look: the code below is my attempt at making slightly different versions of the from_csv which would create a valid dataset.

def from_df_ws(path:PathOrStr, df:pd.DataFrame, folder:PathOrStr='.', sep=None, valid_pct:float=0.2,
                fn_col:IntsOrStrs=0, label_col:IntsOrStrs=1, suffix:str='',
                **kwargs:Any)->'ImageDataBunch':
  "Create from a `DataFrame` `df`."

  "Split the data set"
  df_train, df_valid = train_test_split(df, test_size=valid_pct, random_state=420)
  
  "find all the stuff in valid thats not in train"
  df_diff  = df_valid[~df_valid["Id"].isin(df_train["Id"])]
  
  "take that stuff out of valid"
  df_valid = df_valid[~df_valid["Id"].isin(df_diff["Id"])]

  train_iil = ImageItemList.from_df(df_train, path=path, folder=folder, suffix=suffix, cols=fn_col)
  valid_iil = ImageItemList.from_df(df_valid, path=path, folder=folder, suffix=suffix, cols=fn_col)

  src = (ItemLists(path, train_iil, valid_iil)
            .label_from_df(sep=sep, cols=label_col)) 

  return ImageDataBunch.create_from_ll(src, **kwargs)


def from_csv_ws(path:PathOrStr, folder:PathOrStr='.', sep=None, csv_labels:PathOrStr='labels.csv', valid_pct:float=0.2,
            fn_col:int=0, label_col:int=1, suffix:str='',
            header:Optional[Union[int,str]]='infer', **kwargs:Any)->'ImageDataBunch':
        "Create from a csv file in `path/csv_labels`."
        path = Path(path)
        df = pd.read_csv(path/csv_labels, header=header)
        return from_df_ws(path, df, folder=folder, sep=sep, valid_pct=valid_pct,
                fn_col=fn_col, label_col=label_col, suffix=suffix, **kwargs)


bs=768
tfms = get_transforms(max_rotate=20, max_zoom=1.3, max_lighting=0.4, max_warp=0.4, p_affine=1., p_lighting=1.)
data = from_csv_ws(path=BASE, folder=f'train', csv_labels="train.csv", ds_tfms=tfms, bs=bs, size=sz)