Fastai v2 tabular

Thanks @MicPie for making me realize I was actually doing something wrong! (though not explicitly)

I forgot to essentially save away a large dataframe instead of doing df = df.append(df) (fun fact your memory usage goes up astronomically by doing so)

Here’s my new numbers (with still good reduction):
3 million rows went from 6.5gb in memory to 8.1 gb vs 10.2 - 2gb saved
2 million: 4.5 to 5.7, saving 1gb

Much more rational numbers :slight_smile:

1 Like

Question: is it possible to adjust the values inside of a TabularPandas object? I’ve tried doing something like so:

dl.items[col] = x
dl.process
but when I do dl.items.head() more than just that one column was adjusted
(I included the process because otherwise dl.items.head never changed)

Edit: when I do dl.items.iloc[0] it does show this change so I’m unsure why head didn’t catch it

For those wondering why the heck that matters, meet how simple permutation importance is (without any copies data frames :slight_smile: ):

def measure_col(self, name:str):
    "Measures change after column shuffle"
    col = [name]
    if f'{name}_na' in self.na: col.append(name)
    orig = self.dl.items[col].values
    perm = np.random.permutation(len(orig))
    self.dl.items[col] = self.dl.items[col].values[perm]
    metric = learn.validate(dl=self.dl)[1]
    self.dl.items[col] = orig
    return metric

from a quick glance at the pandas source, it is odd that head() would return something diff from iloc, since head seems to use iloc https://github.com/pandas-dev/pandas/blob/fd2e002e87eaabff3bd8d05bfaa037df468cd752/pandas/core/generic.py#L4685

It makes sense but I can’t quite explain it’s behavior except perhaps two different memory locations are changed Maybe

(For the record the above code does work, it’s not an issue moreso a strange set of interactions :slight_smile: )

you are probably right. the only other thing i could think of is maybe it’s a threading thing? maybe you are in a diff thread when you call

dl.items[col] = x

opposed to

dl.items.head() 

Hi,

  1. while trying to do regression on my own data with tabular_learner I am getting AssertionError: Could not infer loss function from the data, please pass a loss function (see detailed error output below)

  2. Following to the previous broblem, when I specify loss_func=mse I get extreme train_loss and valid_loss, while fastai v1 works fine with the same data.

  3. TabularPandas object takes ages to create, while fastai v1 TabularList is quick.

I would appreciate your help. Thank you!

from fastai2.tabular.all import *

path = Path('tutorial_learn_path')

cont,cat = cont_cat_split(df, max_card=700, dep_var='price')
valid_inxs = set_.sample(int(len(df)/5)).index
splits = IndexSplitter(list(valid_inxs))(range_of(df))
procs = [Categorify, FillMissing, Normalize]

to = TabularPandas(df, procs, cat, cont, y_names=name, splits=splits)
dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, metrics=rmse)
learn.fit_one_cycle(10, 1e-3)

error message:

AssertionError Traceback (most recent call last)
in
----> 1 learn = tabular_learner(dls, metrics=rmse)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/tabular/learner.py in tabular_learner(dls, layers, emb_szs, config, n_out, y_range, ps, embed_p, use_bn, bn_final, bn_cont, **kwargs)
35 model = TabularModel(emb_szs, len(dls.cont_names), n_out, layers, ps=ps, embed_p=embed_p,
36 y_range=y_range, use_bn=use_bn, bn_final=bn_final, bn_cont=bn_cont, **config)
—> 37 return TabularLearner(dls, model, **kwargs)
38
39 # Cell

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/learner.py in init(self, dls, model, loss_func, opt_func, lr, splitter, cbs, metrics, path, model_dir, wd, wd_bn_bias, train_bn, moms)
78 if loss_func is None:
79 loss_func = getattr(dls.train_ds, ‘loss_func’, None)
—> 80 assert loss_func is not None, “Could not infer loss function from the data, please pass a loss function.”
81 self.loss_func = loss_func
82 self.path = path if path is not None else getattr(dls, ‘path’, Path(’.’))

AssertionError: Could not infer loss function from the data, please pass a loss function.

1 Like

As you’re doing regression, you should us a RegressionBlock like so:

block_y = RegressionBlock() (in your TabularPandas object)

I have tried the following, but got the same ‘Could not infer loss function’ error statement:

to = TabularPandas(set_, procs, cat, cont, y_names=name, splits=splits, block_y = RegressionBlock())

TabularPandas is different from TabularList as a whole. We’ve done what we can to reduce the memory overhead and time, but this is not going to change much. You can set reduce_memory to False and inplace to True and it should speed things up a bit.

Also, what does dls.show_batch() show?

dls.show_batch() shows 10 rows of data, as expected

For the temporary, here is what I would recommend:

Follow the rossmann example and generate a y_range to help your model narrow down the outputs, this should help with the losses, and declare your loss function as well (not just the metric). I’ll look more into this issue on my side and see if I can recreate your bug :slight_smile:

Oh, I forgot to normalize my dependant variable, thus inadequate losses. Not it is reasonable. Thank you!

Actually we shouldn’t do this BTW :slight_smile: If your y’s are very large, you can get the log of them though!

(Or at the very least it’s not normally done)

It varies from 25 to 400. I have used df[name] = np.log(df[name]+np.e)

I’d set a y_range then (similar to what was done for Rossmann) and see if this helps

1 Like

While setting inplace=True for TabularPandas I get a setting with copy pandas error

Yes, you should follow the warning message that pops up when generating the TabularPandas object :wink:

You mean I should use .copy() ? :thinking:

No. You need to set pandas chained mode to None.

You should see this warning:

Using inplace with splits will trigger a pandas error. Set pd.options.mode.chained_assignment=None to avoid it

1 Like