Fastai v2 tabular

It varies from 25 to 400. I have used df[name] = np.log(df[name]+np.e)

I’d set a y_range then (similar to what was done for Rossmann) and see if this helps

1 Like

While setting inplace=True for TabularPandas I get a setting with copy pandas error

Yes, you should follow the warning message that pops up when generating the TabularPandas object :wink:

You mean I should use .copy() ? :thinking:

No. You need to set pandas chained mode to None.

You should see this warning:

Using inplace with splits will trigger a pandas error. Set pd.options.mode.chained_assignment=None to avoid it

1 Like

Hi, what is the best practice for saving a model for production inference? When using fasai v1 I used to save model data with the following snippet to inference with a test set in production.

with open(pickle_file.pickle', 'wb') as f:
    pickle.dump((data, cats, fill_miss, norm, cont_names, cat_names), f, protocol=4)

Is there a lighter way introdused in fastai v2 ?

Thank you!

Everything pickles, so we use pickle. If there are tensors involved, we use the PyTorch wrapper around it (torch.save/torch.load).

so I have to unpickle all the stuff model, data, cats, fill_miss, norm, cont_names, cat_names to inference in production, right? No other options?

You can pickle your whole Learner (or use Learn.export if you want to get rid of the data).

1 Like

While dealing with big datasets, that do not fit in memory, if one wants to use .fit_one_cycle, all the data must be fed in a single epoch, right? How to actually do it? Via callback? If so, do you have the one, that feeds data while learning, or should I try to write my own?

There is nothing out of the box for dataframes that don’t fit into memory. You would need to write your own Transform to load it lazily.

There is nothing out of the box for dataframes that don’t fit into memory. You would need to write your own Transform to load it lazily.

so, it is not a callback then?

While exporting a model with learn.export() I the error OverflowError: cannot serialize a string larger than 4GiB.

The usual fix for this issue is protocol=4, like here: pickle.dump(object, file, protocol=4).
Please consider updating the .export method. Thanks.

Assuming self.dl.items is a DataFrame, are you sure orig isn’t changed (hence can be used to restore value at the end)? I thought orig is first created as view, not a copy (unless values.copy(), or df = df.assign(...) is used):

df = pd.DataFrame({"X": [0,11,22,33,44]})
orig = df['X'].values
print(f"Original -- orig: {orig}, df['X'].values: {df['X'].values}")

df['X'] = df['X'].values[[4,0,2,3,1]]
print(f"Shuffled -- orig: {orig}, df['X'].values: {df['X'].values}")

gives:

Original -- orig: [ 0 11 22 33 44], df['X'].values: [ 0 11 22 33 44]
Shuffled -- orig: [44  0 22 33 11], df['X'].values: [44  0 22 33 11]

Hi, I have fond the way to do it, see in this thread Fastai2 tabular for out of memory datasets

1 Like

When in the order does FillMissing occur? I see that Categorify is 1 and NormalizeTab has 2, but FillMissing doesn’t have one. Would we presume 3 (or last?)

It seems TabularDataLoaders.from_df() fails if one of the categorical variables has actual None values (my data had this for some strange reason).

To recreate:

df = pd.DataFrame({'a':[1,2,None], 'b':[3,4,'tmp']})
df.iloc[2,1] = None # Pandas seem to cast None to NaN in the constructor
CategoryMap(df['a'], add_na=True) # works fine
CategoryMap(df['b'], add_na=True) # gives an error

The last line gives:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-d98dc3407d7a> in <module>
----> 1 CategoryMap(df['b'], add_na=True)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/data/transforms.py in __init__(self, col, sort, add_na)
    209             # `o==o` is the generalized definition of non-NaN used by Pandas
    210             items = L(o for o in col.unique() if o==o)
--> 211             if sort: items = items.sorted()
    212         self.items = '#na#' + items if add_na else items
    213         self.o2i = defaultdict(int, self.items.val2idx()) if add_na else dict(self.items.val2idx())

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastcore/foundation.py in sorted(self, key, reverse)
    360         elif isinstance(key,int): k=itemgetter(key)
    361         else: k=key
--> 362         return self._new(sorted(self.items, key=k, reverse=reverse))
    363 
    364     @classmethod

TypeError: '<' not supported between instances of 'NoneType' and 'int'

This is perhaps best fixed in preprocessing with df.fillna(value=np.nan) which gets rid of the None values?

1 Like

I am trying to follow the MultiCategory examples in https://github.com/fastai/fastai2/blob/master/nbs/40_tabular.core.ipynb. My particular dataset is formated similar to the “not one hot encoded” section that contains the _mock_multi_label. (it’s formatted like it’s output). I managed to get it working by following the one-hot-encoded labels and performed something like so:

vals = merged_df[y_names].unique()
c = []
for val in vals:
    c += val.split(' ')
c = list(set(c))
def _mock_multi_label(df, classes=c):
    targ_dict = {}
    for c in classes:
        targ_dict[c] = []
    for row in df.itertuples():
        labels = row.action.split(' ')
        for c in classes:
            targ_dict[c] = c in labels
    for c in classes:
        df[c] = np.array(targ_dict[c])
    return df

df_main = _mock_multi_label(merged_df, c)
@EncodedMultiCategorize
def encodes(self, to:Tabular): return to

@EncodedMultiCategorize
def decodes(self, to:Tabular):
    to.transform(to.y_names, lambda c: c==1)
    return to
to = TabularPandas(merged_df, procs=[], cat_names=[], cont_names = cont_names,
                  y_names=c, y_block = MultiCategoryBlock(encoded=True, vocab=c), splits=splits)

This builds my DataLoaders just fine. From there, to not get an issue, we need to set dls.c to be the len() of c:
dls.c = len(c) :slight_smile:

I feel this process should not be this tedious, let me know any ideas. (Should the encodes/decodes be in the actual library too?)