Fastai v2 tabular

No. You need to set pandas chained mode to None.

You should see this warning:

Using inplace with splits will trigger a pandas error. Set pd.options.mode.chained_assignment=None to avoid it

1 Like

Hi, what is the best practice for saving a model for production inference? When using fasai v1 I used to save model data with the following snippet to inference with a test set in production.

with open(pickle_file.pickle', 'wb') as f:
    pickle.dump((data, cats, fill_miss, norm, cont_names, cat_names), f, protocol=4)

Is there a lighter way introdused in fastai v2 ?

Thank you!

Everything pickles, so we use pickle. If there are tensors involved, we use the PyTorch wrapper around it (torch.save/torch.load).

so I have to unpickle all the stuff model, data, cats, fill_miss, norm, cont_names, cat_names to inference in production, right? No other options?

You can pickle your whole Learner (or use Learn.export if you want to get rid of the data).

1 Like

While dealing with big datasets, that do not fit in memory, if one wants to use .fit_one_cycle, all the data must be fed in a single epoch, right? How to actually do it? Via callback? If so, do you have the one, that feeds data while learning, or should I try to write my own?

There is nothing out of the box for dataframes that don’t fit into memory. You would need to write your own Transform to load it lazily.

There is nothing out of the box for dataframes that don’t fit into memory. You would need to write your own Transform to load it lazily.

so, it is not a callback then?

While exporting a model with learn.export() I the error OverflowError: cannot serialize a string larger than 4GiB.

The usual fix for this issue is protocol=4, like here: pickle.dump(object, file, protocol=4).
Please consider updating the .export method. Thanks.

Assuming self.dl.items is a DataFrame, are you sure orig isn’t changed (hence can be used to restore value at the end)? I thought orig is first created as view, not a copy (unless values.copy(), or df = df.assign(...) is used):

df = pd.DataFrame({"X": [0,11,22,33,44]})
orig = df['X'].values
print(f"Original -- orig: {orig}, df['X'].values: {df['X'].values}")

df['X'] = df['X'].values[[4,0,2,3,1]]
print(f"Shuffled -- orig: {orig}, df['X'].values: {df['X'].values}")

gives:

Original -- orig: [ 0 11 22 33 44], df['X'].values: [ 0 11 22 33 44]
Shuffled -- orig: [44  0 22 33 11], df['X'].values: [44  0 22 33 11]

Hi, I have fond the way to do it, see in this thread Fastai2 tabular for out of memory datasets

1 Like

When in the order does FillMissing occur? I see that Categorify is 1 and NormalizeTab has 2, but FillMissing doesn’t have one. Would we presume 3 (or last?)

It seems TabularDataLoaders.from_df() fails if one of the categorical variables has actual None values (my data had this for some strange reason).

To recreate:

df = pd.DataFrame({'a':[1,2,None], 'b':[3,4,'tmp']})
df.iloc[2,1] = None # Pandas seem to cast None to NaN in the constructor
CategoryMap(df['a'], add_na=True) # works fine
CategoryMap(df['b'], add_na=True) # gives an error

The last line gives:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-d98dc3407d7a> in <module>
----> 1 CategoryMap(df['b'], add_na=True)

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastai2/data/transforms.py in __init__(self, col, sort, add_na)
    209             # `o==o` is the generalized definition of non-NaN used by Pandas
    210             items = L(o for o in col.unique() if o==o)
--> 211             if sort: items = items.sorted()
    212         self.items = '#na#' + items if add_na else items
    213         self.o2i = defaultdict(int, self.items.val2idx()) if add_na else dict(self.items.val2idx())

~/anaconda3/envs/fastai2/lib/python3.7/site-packages/fastcore/foundation.py in sorted(self, key, reverse)
    360         elif isinstance(key,int): k=itemgetter(key)
    361         else: k=key
--> 362         return self._new(sorted(self.items, key=k, reverse=reverse))
    363 
    364     @classmethod

TypeError: '<' not supported between instances of 'NoneType' and 'int'

This is perhaps best fixed in preprocessing with df.fillna(value=np.nan) which gets rid of the None values?

I am trying to follow the MultiCategory examples in https://github.com/fastai/fastai2/blob/master/nbs/40_tabular.core.ipynb. My particular dataset is formated similar to the “not one hot encoded” section that contains the _mock_multi_label. (it’s formatted like it’s output). I managed to get it working by following the one-hot-encoded labels and performed something like so:

vals = merged_df[y_names].unique()
c = []
for val in vals:
    c += val.split(' ')
c = list(set(c))
def _mock_multi_label(df, classes=c):
    targ_dict = {}
    for c in classes:
        targ_dict[c] = []
    for row in df.itertuples():
        labels = row.action.split(' ')
        for c in classes:
            targ_dict[c] = c in labels
    for c in classes:
        df[c] = np.array(targ_dict[c])
    return df

df_main = _mock_multi_label(merged_df, c)
@EncodedMultiCategorize
def encodes(self, to:Tabular): return to

@EncodedMultiCategorize
def decodes(self, to:Tabular):
    to.transform(to.y_names, lambda c: c==1)
    return to
to = TabularPandas(merged_df, procs=[], cat_names=[], cont_names = cont_names,
                  y_names=c, y_block = MultiCategoryBlock(encoded=True, vocab=c), splits=splits)

This builds my DataLoaders just fine. From there, to not get an issue, we need to set dls.c to be the len() of c:
dls.c = len(c) :slight_smile:

I feel this process should not be this tedious, let me know any ideas. (Should the encodes/decodes be in the actual library too?)

I don’t understand your question. Those patched encodes and decodes are in tabular.core.

1 Like

I guess my question is will we ever support non-encoded multi-categorize? Instead of having to pre-process the one hot encode. Similar to how in the PLANETs we can pass in a delimiter to get_y. (I understand the two are separate from just MultiCategoryBlock, so this may not be straightforward to do)

If not, I’ll emphasize this in the docs :slight_smile:

There is an example of that and the needed patches on MultiCategorize just after in the notebook.

Oh, though it’s not currently working. Guess there is a TODO to fix it missing.

1 Like

For those interested, I’m working on trying to get fastai2 tabular to support multiple datatypes. This is based on my NumPy tutorial, and as we go along I think we’ll find a good way to integrate it to the ecosystem towards something we could possibly push to the main repo. If you’re interested in helping, see here:

Currently we’re looking at NumPy, cuDF, and others.

1 Like

Does anyone know what the line

df[prefix + ‘Elapsed’] = field.astype(np.int64) // 10 ** 9

in add_datepart does exactly? I don’t really understand how to interpret the resulting feature.

Here is the full source code:

def add_datepart(df, field_name, prefix=None, drop=True, time=False):
        "Helper function that adds columns relevant to a date in the column `field_name` of `df`."
        make_date(df, field_name)
        field = df[field_name]
        prefix = ifnone(prefix, re.sub('[Dd]ate$', '', field_name))
        attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start',
                'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
        if time: attr = attr + ['Hour', 'Minute', 'Second']
        for n in attr: df[prefix + n] = getattr(field.dt, n.lower())
        df[prefix + 'Elapsed'] = field.astype(np.int64) // 10 ** 9
        if drop: df.drop(field_name, axis=1, inplace=True)
        return df