Tabular learning dataframe errors (chapter 9)

Hi all, I’m making my own small example data to follow along with chapter 9, to try to understand the fastai library better (and actually just understand pandas better). I’m making a dataframe like this:

df = pd.DataFrame(index=range(20), columns=range(4))
for dfRowIndex in range(df.shape[0]):
    df.loc[dfRowIndex, 1] = random.uniform(0, 100)
    df.loc[dfRowIndex, 2] = random.uniform(0, 100)
    df.loc[dfRowIndex, 3] = random.uniform(0, 100)
    if df.loc[dfRowIndex,1] > df.loc[dfRowIndex,3]:
        df.loc[dfRowIndex, 0] = int(1)
    else:
        df.loc[dfRowIndex, 0] = int(0)
print(df)

So that column0 will be 1 if column1 is greater than column3, and 0 otherwise. column2 has no effect on anything.

Then i try this:

procs = [Categorify, FillMissing, Normalize]
numberOfValidationRows = 5
splits = (list(range(numberOfValidationRows,df.shape[0])),
          list(range(0,numberOfValidationRows))) 
cat_names = [0]
cont_names = list(range(1, 4))
y_names = [0]
to = TabularPandas(df, procs, cat_names, cont_names,
                   y_names=y_names, y_block=CategoryBlock, splits=splits)
to.show(1)

The last line causes an error “ValueError: Columns must be same length as key”

I can’t see why this is the case. And then regardless of the attempt to do to.show(), if I do something like this:

dls = to.dataloaders(5)
learn = tabular_learner(dls, y_range=(0,1), layers=[500,250],
                        n_out=1, loss_func=F.mse_loss)
learn.lr_find()

It causes a different error: “RuntimeError: CUDA driver error: unknown error”
(But the code in the chapter 9 notebook on the same server works fine)

Maybe the 2nd error is for the same reason as the 1st one?

Any pointers to what I’m doing wrong here? Am I misunderstanding something to do with how a dataframe becomes a TabularPandas? Thanks!

The first error seems to be because of this line:

~/anaconda3/envs/fastai/lib/python3.8/site-packages/fastai/tabular/core.py in show(self, max_n, **kwargs)
--> 175     def show(self, max_n=10, **kwargs): display_df(self.new(self.all_cols[:max_n]).decode().items)

all_cols seems to add an extra column on to the end of the dataframe - I can test it like this:

print (f'--\n\n{to[:3]}\n\n--\n\n{to.all_cols[:3]}')

--

   0         1         2         3
5  0 -0.195872 -0.741504  0.556207
6  1  0.772629 -1.632479  1.315034
7  1  0.365106 -0.242548 -0.963825

--

   0         1         2         3  0
5  0 -0.195872 -0.741504  0.556207  0
6  1  0.772629 -1.632479  1.315034  1
7  1  0.365106 -0.242548 -0.963825  1

This then causes the “ValueError: Columns must be same length as key” error later. I’m still not sure how to fix it though, because i can’t call to.show() without it doing the all_cols thing

I see the problem now! the y columns should not appear in cont/cat lists at all.

1 Like