I did a bunch of preproc work on a raw csv file using TabularPandas, and I’d like to save all that data into a csv file. What method do I invoke to save the processed data into a csv?
Thanks!
Is your data in a pandas.DataFrame
? If yes, it is very simple with df.to_csv()
. You might also want to set index=False
in to_csv()
IIRC you can access the internal dataframe via to.data
or to.items
. You can then follow @BresNet’s suggestion on how to save it
(If you’re interested in just saving the tabular pandas, I can also link to this)
It’s a TabularPandas object, something from fastai
Well there was an attribute called ‘xs’ that had it. I saved it as to_nn.xs.to_csv(‘proc_data.csv’)
I have an other problem that popped up, if you wouldn’t mind. After training the model, I took an entry from proc_data, added a batch dimenstion, converted it to torch and passed to the model. I was trying to mimic an inference situation
learn.model(torch.from_numpy(prod.iloc[0].values[None]))
But I got an error,
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.DoubleTensor instead (while checking arguments for embedding)
Isn’t Long an integer datatype? Why is it popping up here?
for info, I’m working on the Rossman Dataset, I merged the train and store data on the store column, and then just followed the book.
The issue here is fastai’s model has two inputs: the categorical and continuous variables. Let’s rewrite that so it’s a bit more what it’s liking (as it currently things everything is a categorical variable):
(note there is no fastai magic here except for preprocessing):
row = prod.iloc[0]
cat_idxs = [prod.columns.get_loc(nm) for nm in cat_names] # SUPER IMPORTANT
cont_idxs = [prod.columns.get_loc(nm) for nm in cont_names]
cat = row[cat_idxs]
cont = row[cont_idxs]
learn.model.eval()
cat = tensor(cat, dtype=torch.long).unsqueeze(0)
cont = tensor(cont, dtype=torch.double).unsqueeze(0)
learn.model(cat, cont)
Now let’s walk through the why:
We need to get cat_idxs and cont_idxs because fastai’s tabular model has two inputs, a cat tensor and a cont tensor. On top of this, with how the preprocessing works fastai adds in a _na
column during FillMissing
, so if we check our cat_names
you’ll notice it has a _na
added to it (potentially many since you’re doing rossmann)
Next after we extract them from a row, we need to convert them to datatypes the fastai model expects: Long
for the categorical, and Double
for the continues. On top of this since it works by batch, we need to unsqueeze our tensors one dimension (aka just make something like [0,1,2]
into [[0,1,2]]
After we do this we can pass it to learn.model.
Now personally I’d take more of this approach (which is how we mimic an inference situation):
dl = learn.dls.test_dl(prod.iloc[:1]) # We do to one so that fastai can properly process it
cat, cont, _ = dl.one_batch()
learn.eval()
learn.model(cat,cont)
And just let fastai handle the preprocessing. (especially since you’re doing learn.model
here). But the first way shows how to preprocess and use that data instead.
So basically minimally in production you will need:
- Your list of post processed cat_names to get their indicies (or be smart and plan ahead, and just store them away)
- The list of cont_names and their indicies
- Convert the data to the proper types
- Unsqueeze the dimension to “batchify” them
- Pass it to the model
This is absolutely perfect. So embeddings by default expect integer inputs.Thank you so much!
Yup since basically Embeddings are look up tables. We can’t have .1 of a King card can we?