Tabular Data Labelling/Training Issue

Mattlok · August 23, 2019, 3:22pm

I’m trying to train a model using tabular data and running into a couple of issues and I’m not sure of the root cause. I’ve simplified my dataset to only use three columns:
Label, Histogram and General
Label is the dependent variable, Histogram is an array of 256 floats and General is an array of 10 floats.

dep_var = ‘label’
cont_vars = [‘histogram’,‘general’]
procs = [FillMissing, Categorify, Normalize]

data = (TabularList.from_df(data, cat_names=[], cont_names=cont_vars, procs=procs)
.split_by_rand_pct(0.2, seed=42)
.label_from_df(‘label’)
)

I get the following error before I get to create a databunch:
“AssertionError: Cannot normalize ‘histogram’ column as it isn’t numerical.
Are you sure it doesn’t belong in the categorical set of columns?”

Is this because it’s an array? If so what should be the best way to handle this feature array, does every number require its own column?

Thank you!

muellerzr · August 23, 2019, 3:47pm

You’d probably want a column per number here to work best with it, or try some creative feature engineering to translate the histogram over.

Mattlok · August 23, 2019, 5:18pm

Thank you for the response! Do you know if this was a design decision with the fast ai library ? For instance, would a library like tensorflow be able handle an array as a column value ? This just could be my misunderstanding as to how the libraries handle different data types.

muellerzr · August 23, 2019, 5:20pm

You could modify your own custom item list to take it in how you’d hope to achieve it. The library is very versatile, look at the ItemList tutorial in the docs docs.fast.ai

muellerzr · August 23, 2019, 5:21pm

But in general yes, when dealing with tabular data, it’s continuous or categorical for the fastai library.

Mattlok · August 23, 2019, 5:39pm

Would you suggest creating a custom item list that flattens this array into one number ? Or something that is able to handle the array of 256 elements? Thanks again for clarification!

muellerzr · August 23, 2019, 5:44pm

Perhaps an array and pass it in as categorical. Play around with it and report back!

Mattlok · August 25, 2019, 7:22pm

Passing the variables as categorical resolved the issue! Thanks for the help, was able to get past my hitch.

muellerzr · August 25, 2019, 7:24pm

No problem! Happy to help

sidd.suresh97 · September 16, 2019, 9:42am

Hey, I have the same problem. I have a bunch of values in an array as a value in a columns. This means that I have a column which look like this

Column X
[.39039,1.39239,…]
[3.430940,-3.93902…]
…
…
[0.39,-34934-,…]

How do I use the fastai library by passing variables as categorical . Could you please elaborate?

Mattlok · September 17, 2019, 2:00pm

The TabularList.from_df function takes a couple of array that you can specify categorical and continuous variable column names. From my example above:

cat_vars = [‘here’, ‘are’, ‘categorical’,‘variables’]
cont_vars = [‘histogram’,‘general’]
procs = [FillMissing, Categorify, Normalize]

data = (TabularList.from_df(data, cat_names=cat_vars, cont_names=cont_vars, procs=procs)

@sidd.suresh97