How to save TabularPandas data to a regular Pandas Dataframe

vishak · April 12, 2021, 4:53am

I did a bunch of preproc work on a raw csv file using TabularPandas, and I’d like to save all that data into a csv file. What method do I invoke to save the processed data into a csv?
Thanks!

BresNet · April 12, 2021, 3:34pm

Is your data in a pandas.DataFrame? If yes, it is very simple with df.to_csv(). You might also want to set index=False in to_csv()

muellerzr · April 12, 2021, 3:36pm

IIRC you can access the internal dataframe via to.data or to.items. You can then follow @BresNet’s suggestion on how to save it

(If you’re interested in just saving the tabular pandas, I can also link to this)

vishak · April 12, 2021, 3:46pm

It’s a TabularPandas object, something from fastai

vishak · April 12, 2021, 4:28pm

Well there was an attribute called ‘xs’ that had it. I saved it as to_nn.xs.to_csv(‘proc_data.csv’)

I have an other problem that popped up, if you wouldn’t mind. After training the model, I took an entry from proc_data, added a batch dimenstion, converted it to torch and passed to the model. I was trying to mimic an inference situation

learn.model(torch.from_numpy(prod.iloc[0].values[None]))

But I got an error,

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.DoubleTensor instead (while checking arguments for embedding)

Isn’t Long an integer datatype? Why is it popping up here?

for info, I’m working on the Rossman Dataset, I merged the train and store data on the store column, and then just followed the book.

muellerzr · April 12, 2021, 5:37pm

The issue here is fastai’s model has two inputs: the categorical and continuous variables. Let’s rewrite that so it’s a bit more what it’s liking (as it currently things everything is a categorical variable):

(note there is no fastai magic here except for preprocessing):

row = prod.iloc[0]
cat_idxs = [prod.columns.get_loc(nm) for nm in cat_names] # SUPER IMPORTANT
cont_idxs = [prod.columns.get_loc(nm) for nm in cont_names]
cat = row[cat_idxs]
cont = row[cont_idxs]

learn.model.eval()

cat = tensor(cat, dtype=torch.long).unsqueeze(0)
cont = tensor(cont, dtype=torch.double).unsqueeze(0)

learn.model(cat, cont)

Now let’s walk through the why:

We need to get cat_idxs and cont_idxs because fastai’s tabular model has two inputs, a cat tensor and a cont tensor. On top of this, with how the preprocessing works fastai adds in a _na column during FillMissing, so if we check our cat_names you’ll notice it has a _na added to it (potentially many since you’re doing rossmann)

Next after we extract them from a row, we need to convert them to datatypes the fastai model expects: Long for the categorical, and Double for the continues. On top of this since it works by batch, we need to unsqueeze our tensors one dimension (aka just make something like [0,1,2] into [[0,1,2]]

After we do this we can pass it to learn.model.

Now personally I’d take more of this approach (which is how we mimic an inference situation):

dl = learn.dls.test_dl(prod.iloc[:1]) # We do to one so that fastai can properly process it
cat, cont, _ = dl.one_batch()
learn.eval()
learn.model(cat,cont)

And just let fastai handle the preprocessing. (especially since you’re doing learn.model here). But the first way shows how to preprocess and use that data instead.

So basically minimally in production you will need:

Your list of post processed cat_names to get their indicies (or be smart and plan ahead, and just store them away)
The list of cont_names and their indicies
Convert the data to the proper types
Unsqueeze the dimension to “batchify” them
Pass it to the model

vishak · April 13, 2021, 12:16pm

This is absolutely perfect. So embeddings by default expect integer inputs.Thank you so much!

muellerzr · April 13, 2021, 12:26pm

Yup since basically Embeddings are look up tables. We can’t have .1 of a King card can we?