Fastai v2 tabular

DmitryG · March 19, 2020, 2:34pm

It varies from 25 to 400. I have used df[name] = np.log(df[name]+np.e)

muellerzr · March 19, 2020, 2:34pm

I’d set a y_range then (similar to what was done for Rossmann) and see if this helps

DmitryG · March 19, 2020, 2:39pm

While setting inplace=True for TabularPandas I get a setting with copy pandas error

muellerzr · March 19, 2020, 2:40pm

Yes, you should follow the warning message that pops up when generating the TabularPandas object

DmitryG · March 19, 2020, 2:43pm

You mean I should use .copy() ?

muellerzr · March 19, 2020, 2:43pm

No. You need to set pandas chained mode to None.

muellerzr · March 19, 2020, 2:44pm

You should see this warning:

Using inplace with splits will trigger a pandas error. Set pd.options.mode.chained_assignment=None to avoid it

DmitryG · March 19, 2020, 6:30pm

Hi, what is the best practice for saving a model for production inference? When using fasai v1 I used to save model data with the following snippet to inference with a test set in production.

with open(pickle_file.pickle', 'wb') as f:
    pickle.dump((data, cats, fill_miss, norm, cont_names, cat_names), f, protocol=4)

Is there a lighter way introdused in fastai v2 ?

Thank you!

sgugger · March 19, 2020, 6:47pm

Everything pickles, so we use pickle. If there are tensors involved, we use the PyTorch wrapper around it (torch.save/torch.load).

DmitryG · March 19, 2020, 6:50pm

so I have to unpickle all the stuff model, data, cats, fill_miss, norm, cont_names, cat_names to inference in production, right? No other options?

sgugger · March 19, 2020, 6:52pm

You can pickle your whole Learner (or use Learn.export if you want to get rid of the data).

DmitryG · March 19, 2020, 7:01pm

While dealing with big datasets, that do not fit in memory, if one wants to use .fit_one_cycle, all the data must be fed in a single epoch, right? How to actually do it? Via callback? If so, do you have the one, that feeds data while learning, or should I try to write my own?

sgugger · March 19, 2020, 7:06pm

There is nothing out of the box for dataframes that don’t fit into memory. You would need to write your own Transform to load it lazily.

DmitryG · March 19, 2020, 7:10pm

There is nothing out of the box for dataframes that don’t fit into memory. You would need to write your own Transform to load it lazily.

so, it is not a callback then?

DmitryG · March 19, 2020, 7:59pm

While exporting a model with learn.export() I the error OverflowError: cannot serialize a string larger than 4GiB.

The usual fix for this issue is protocol=4, like here: pickle.dump(object, file, protocol=4).
Please consider updating the .export method. Thanks.

philchu · April 16, 2020, 5:45am

Assuming self.dl.items is a DataFrame, are you sure orig isn’t changed (hence can be used to restore value at the end)? I thought orig is first created as view, not a copy (unless values.copy(), or df = df.assign(...) is used):

df = pd.DataFrame({"X": [0,11,22,33,44]})
orig = df['X'].values
print(f"Original -- orig: {orig}, df['X'].values: {df['X'].values}")

df['X'] = df['X'].values[[4,0,2,3,1]]
print(f"Shuffled -- orig: {orig}, df['X'].values: {df['X'].values}")

gives:

Original -- orig: [ 0 11 22 33 44], df['X'].values: [ 0 11 22 33 44]
Shuffled -- orig: [44  0 22 33 11], df['X'].values: [44  0 22 33 11]

DmitryG · April 19, 2020, 10:42am

Hi, I have fond the way to do it, see in this thread Fastai2 tabular for out of memory datasets

muellerzr · May 18, 2020, 2:00pm

When in the order does FillMissing occur? I see that Categorify is 1 and NormalizeTab has 2, but FillMissing doesn’t have one. Would we presume 3 (or last?)

hallvagi · May 20, 2020, 2:06pm

It seems TabularDataLoaders.from_df() fails if one of the categorical variables has actual None values (my data had this for some strange reason).

To recreate:

df = pd.DataFrame({'a':[1,2,None], 'b':[3,4,'tmp']})
df.iloc[2,1] = None # Pandas seem to cast None to NaN in the constructor
CategoryMap(df['a'], add_na=True) # works fine
CategoryMap(df['b'], add_na=True) # gives an error

The last line gives:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-d98dc3407d7a> in <module>
----> 1 CategoryMap(df['b'], add_na=True)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/data/transforms.py in __init__(self, col, sort, add_na)
    209             # `o==o` is the generalized definition of non-NaN used by Pandas
    210             items = L(o for o in col.unique() if o==o)
--> 211             if sort: items = items.sorted()
    212         self.items = '#na#' + items if add_na else items
    213         self.o2i = defaultdict(int, self.items.val2idx()) if add_na else dict(self.items.val2idx())

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastcore/foundation.py in sorted(self, key, reverse)
    360         elif isinstance(key,int): k=itemgetter(key)
    361         else: k=key
--> 362         return self._new(sorted(self.items, key=k, reverse=reverse))
    363 
    364     @classmethod

TypeError: '<' not supported between instances of 'NoneType' and 'int'

This is perhaps best fixed in preprocessing with df.fillna(value=np.nan) which gets rid of the None values?

muellerzr · May 24, 2020, 1:44am

I am trying to follow the MultiCategory examples in https://github.com/fastai/fastai2/blob/master/nbs/40_tabular.core.ipynb. My particular dataset is formated similar to the “not one hot encoded” section that contains the _mock_multi_label. (it’s formatted like it’s output). I managed to get it working by following the one-hot-encoded labels and performed something like so:

vals = merged_df[y_names].unique()
c = []
for val in vals:
    c += val.split(' ')
c = list(set(c))
def _mock_multi_label(df, classes=c):
    targ_dict = {}
    for c in classes:
        targ_dict[c] = []
    for row in df.itertuples():
        labels = row.action.split(' ')
        for c in classes:
            targ_dict[c] = c in labels
    for c in classes:
        df[c] = np.array(targ_dict[c])
    return df

df_main = _mock_multi_label(merged_df, c)
@EncodedMultiCategorize
def encodes(self, to:Tabular): return to

@EncodedMultiCategorize
def decodes(self, to:Tabular):
    to.transform(to.y_names, lambda c: c==1)
    return to
to = TabularPandas(merged_df, procs=[], cat_names=[], cont_names = cont_names,
                  y_names=c, y_block = MultiCategoryBlock(encoded=True, vocab=c), splits=splits)

This builds my DataLoaders just fine. From there, to not get an issue, we need to set dls.c to be the len() of c:
dls.c = len(c)

I feel this process should not be this tedious, let me know any ideas. (Should the encodes/decodes be in the actual library too?)