Fastai v2 tabular

(Zachary Mueller) #43

The initial issue, yes. But then it gets stuck doing transforms. (The stack trace I posted where I quit after 10 minutes)



Except we can’t see anything in a stack trace interrupted like this, so I don’t know what cause the issue.

1 Like

(Zachary Mueller) #45

Ah I understand now. Would it be better to do a %debug before running the cell? Or what would you recommend?

I’ll try that tonight and update you if I notice an origin


(Zachary Mueller) #46

@sgugger odd, reboot and it started working again. I’ll let you know if I run into this bug again :frowning_face:
The issue seems to be with Categorify, as it takes quite a long time (if that’s known it’s okay. Not trying to complain! Just trying to work out how to go about this :slight_smile: )



Normally it’s supposed to be fast, using pd.Series.unique to determine the unique categories. Not sure what’s holding you up…

1 Like

(Zachary Mueller) #48

That was my thought too. I will most likely move to other notebooks for now and come back to it later (if you haven’t when you’re redoing the course-v3) But let me see what I can do tonight


(Zachary Mueller) #49

Doesn’t seem to bottleneck in CategoryMap, investigating further.

The problem is with the Events column and the PromoInterval column. The rest of the colums take <2 seconds, but these do not for some reason


(Zachary Mueller) #50

Aha! Found the solution if a non-category type is trying to be done as a category (eg PromoInterval was not a category), it will not work well. Potential solution is if it is a category (In cat_names), convert each column to a category before calling the encodes? eg in setup.

class Categorify(TabularProc):
    "Transform the categorical variables to that type."
    order = 1
    def setups(self, to):
        self.classes = {n:CategoryMap(to.iloc[:,n].items, add_na=(n in to.cat_names)) for n in to.all_cat_names}
    def _apply_cats (self, add, c): 
      if is_categorical_dtype(c):
        c = self._to_cat(c)
    def _to_cat (self, c): return c.astype('category')
    def _decode_cats(self, c): return[].items)))
    def encodes(self, to):
        to.transform(to.cat_names, partial(self._apply_cats,1))
        to.transform(L(to.cat_y),  partial(self._apply_cats,0))
    def decodes(self, to): to.transform(to.all_cat_names, self._decode_cats)
    def __getitem__(self,k): return self.classes[k]

Let me know your thoughts.

Time Comparison (without the two problem columns):

Original: 35.2s
Mine: 17.8s

Original: 357ms
Mine: 244ms

Edit: @sgugger (sorry to @ you) should we just pre-process our categorical columns to a category type and call it a day? Or what are your thoughts :slight_smile:



I would also like an option to do an external cat to integer mapping and use the value rather than the cat.code. In the case where you will be training on the same data repeatedly, it would be much faster to convert once and be done with it. This also has the benefit of much smaller dataframes - my dailies are 17gb in memory as object and 3.5 after conversion.

Not to mention cases where account numbers or whatever are already integers.

1 Like

(Jeremy Howard (Admin)) #52

Pandas can do that for you.

1 Like


I am mentally mixing my v1 and v2 - gotta remember where I am posting…

v1 gives a ‘Can only use .cat accessor with a ‘category’ dtype’ error if you don’t categorify.

I did a version of v2 with an ‘if is_numerical_dtype’ line added to Category_Map in 06_data_transforms’ which I believe sorted things but that was several weeks ago.



I am trying to use the TabularModel and started from a frech checkout of fastai_dev. Currently TabularModel tries to use BnDropLin, but it seems that this was renamed to LinBnDrop. After adapting that piece I can run 41_tabular_model.

Why is the 41_tabular_model not exported yet? The cells are not marked for export. I guess this is intentional?

When I run the last cell with “notebook2script(all_fs=True)”, somehow I end up with all files under dev/local having changed their relativ imports, e.g. a line
from .basics import *
is changed to
from ......basics import *

I can fix that manually, but I guess I did something wrong with the set-up? I installed the package after the clone via pip install -e .. Did somebody else run into a similar problem before?



I found one more issue with ReadTabBatch. It already says: TODO: use float for cont targ.

I changed the current line:
def encodes(self, to): return tensor(to.cats).long(),tensor(to.conts).float(), tensor(to.targ).long()
def encodes(self, to): return tensor(to.cats).long(),tensor(to.conts).float(), tensor(to.targ).long() if to.cat_y else tensor(to.targ).float()

I am too new to, to tell if this is the correct way to fix this issue or not.


(Jeremy Howard (Admin)) #56

Not only was it renamed, but also the layers were reordered (as the name suggests). It’s possible that the tabular model needs some tweaking as a result.

1 Like

(Zachary Mueller) #57

I’m looking at the new tabular API, are you looking to do something like a FloatBlock for regression tasks? And if so, any pointers for trying to implement such a task? :slight_smile:



The regular TransformBlock should be fine for that.

1 Like

(Zachary Mueller) #59

Thanks for the answer :slight_smile: Do you mean:

to = TabularPandas(df, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=dep_var,
                   splits=splits, block_y=TransformBlock)

Doing so does:
RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'target' in call to _thnn_mse_loss_forward

I also tried defining type_tfms to Float, TransformBlock(type_tfms=Float)



It looks like you need to convert your targets to floats? Float is a type, not a transform, so replace that by a function lambda x: float(x).

Edit: Even better, use MSELossFlat() which should convert your target to float automatically.

1 Like

(Zachary Mueller) #61

@sgugger finally getting back to look at this. I tried MSELossFlat() for my loss function, but it did not convert oddly enough. MSELossFlat did wind up working.

Here is what I am currently trying:

tab = TabularPandas(train_df, procs=procs, cat_names=cat_vars, cont_names=cont_vars, y_names=dep_var, splits=splits, block_y=TransformBlock(type_tfms=lambda x: float(x)))

model = TabularModel(get_emb_sz(tab), len(tab.cont_names), 1, [1000,500], y_range=y_range)
opt_func = partial(Adam, wd=0.01, eps=1e-5)
learn = Learner(tab.databunch(), model, MSELossFlat(), opt_func=opt_func, metrics=rmse)

It won’t really “train” and epoch time is 37 minutes on a GPU! By wont really “train”, initial loss is 58017752.000000. I may wait a bit and move to NLP for my guides if you’re planning on getting to Rossmann (eventually), as it’s causing quite the headache for me :sweat_smile:

1 Like

(Clive Pinfold) #62

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)