Tabular Data Labelling/Training Issue

I’m trying to train a model using tabular data and running into a couple of issues and I’m not sure of the root cause. I’ve simplified my dataset to only use three columns:
Label, Histogram and General
Label is the dependent variable, Histogram is an array of 256 floats and General is an array of 10 floats.

dep_var = ‘label’
cont_vars = [‘histogram’,‘general’]
procs = [FillMissing, Categorify, Normalize]

data = (TabularList.from_df(data, cat_names=[], cont_names=cont_vars, procs=procs)
.split_by_rand_pct(0.2, seed=42)
.label_from_df(‘label’)
)

I get the following error before I get to create a databunch:
“AssertionError: Cannot normalize ‘histogram’ column as it isn’t numerical.
Are you sure it doesn’t belong in the categorical set of columns?”

Is this because it’s an array? If so what should be the best way to handle this feature array, does every number require its own column?

Thank you!

You’d probably want a column per number here to work best with it, or try some creative feature engineering to translate the histogram over.

Thank you for the response! Do you know if this was a design decision with the fast ai library ? For instance, would a library like tensorflow be able handle an array as a column value ? This just could be my misunderstanding as to how the libraries handle different data types.

You could modify your own custom item list to take it in how you’d hope to achieve it. The library is very versatile, look at the ItemList tutorial in the docs :slight_smile: docs.fast.ai

But in general yes, when dealing with tabular data, it’s continuous or categorical for the fastai library.

Would you suggest creating a custom item list that flattens this array into one number ? Or something that is able to handle the array of 256 elements? Thanks again for clarification!

Perhaps an array and pass it in as categorical. Play around with it and report back! :slight_smile:

Passing the variables as categorical resolved the issue! Thanks for the help, was able to get past my hitch.

No problem! Happy to help :slight_smile:

Hey, I have the same problem. I have a bunch of values in an array as a value in a columns. This means that I have a column which look like this

Column X
[.39039,1.39239,…]
[3.430940,-3.93902…]


[0.39,-34934-,…]

How do I use the fastai library by passing variables as categorical . Could you please elaborate?

The TabularList.from_df function takes a couple of array that you can specify categorical and continuous variable column names. From my example above:

cat_vars = [‘here’, ‘are’, ‘categorical’,‘variables’]
cont_vars = [‘histogram’,‘general’]
procs = [FillMissing, Categorify, Normalize]

data = (TabularList.from_df(data, cat_names=cat_vars, cont_names=cont_vars, procs=procs)

@sidd.suresh97