Big Tabular

I’ve got a lot of data in daily files, each with around 3 million rows, 40 cats and 30 conts. I’m trying to build profiles on several million entities so that I can assess new transactions to see how likely they are for the given individual.

I have access to a large server with much RAM and 2 V100s but even so there is no way to fit more than a few days worth of data in a single file so I am looping through the data a day at a time.

The key challenge is consistent categorization. I solved this with a pre-proc step of building an ordered dict by looping through all the files before training. Initialize with a one day cat.dict and then loop through your list of files (octall):

temp = {}
olddict = {}
for x in octall:
    df = pd.read_parquet('../data/fsmodel/'+x)
    print(x)
    for n in cat_names:
        olddict[n] = catdict[n]
        temp[n] = list(df[n].unique())
        catdict[n] = catdict[n]+(list(set(temp[n]) - set(olddict[n])))
    pickle.dump( catdict, open( "/shared/rb/data/fs/cats.dict", "wb" ) )

This took an hour to process a year. Using cudf would probably cut that time down considerably.
Converted to ordereddict afterwards.

Once that is ready, override Categorify to rely on the dict. CategoricalDtype simplifies this:

categories = pickle.load( open( “/shared/rb/data/fs/dict/cats.dict”, “rb” ))
cat_type = CategoricalDtype(categories=categories, ordered=True)

class Categorify(TabularProc):
    def apply_train(self, df:DataFrame):
        #df = df.to_cudf()
        for n in cat_names:
            df[n] = df[n].astype(cat_type)
       #df = df.to_pandas()

Cudf does speed this up but my model is huge and I would crash on larger files so I am sticking to pandas. Uncomment the conversion lines if you can get away with it.

Training - I initialize with one standard cycle, save Learn and then hit the loop.

for x in octall:
df = pd.read_parquet('../data/fsmodel/'+x)[cat_names+cont_names+['TARGET']]
procs = [Categorify]
xx = int(len(df)*.04)
valid_idx = range(len(df)-xx, len(df))
dep_var = 'TARGET'
data = (TabularList.from_df(df[cat_names+cont_names+['TARGET']], path='fs3', cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx=valid_idx)
                           .label_from_df(cols=dep_var, label_cls=FloatList)
                           .databunch(bs=1024*32))
learn.load('fs3')
print(data.show_batch(1))
learn.unfreeze()
lr = 0.0002
#learn.fit_one_cycle(3)
learn.fit(1)
learn.save('fs3)
1 Like

So that is currently training. Valid loss dropped for the first few days but has plateaued with maybe a tiny downtrend. Train loss continues to drop, even though I don’t believe the model is seeing data a second time. I also have a lot of dropout. Target is rare.

Any mistakes or suggestions?

If you feel like using v2, I made a NumPy DataLoader (very barebones) that takes in pre processed data (which you can do via TabularPandas, or there’s also a way to get your raw data from here. If you’re interested let me know). It’s got a 40% speed up per epoch on training.

And if you really want to, you could potentially preprocess with NumPy too, saving a ton of memory

2 Likes

For whatever reason, the proxy at work just won’t work with PIP. I have been able to clone git and pip install local in the past but my recent attempts were failing trying to install things outside of the fastai universe. Now that this is functional I will fight that battle again.

1 Like

If your installs are in general up to date with what fastai2 wants, you may be able to clone and move around the directories to get to the right code instead (I’m doing something similar where I can’t directly install fastai2, so I’ve got fastcore stored away on the disk from git)

1 Like

I think the problematic call was to something with Torch and I couldn’t figure out the specific issue.

Hello @Ralph,

This could be expensive: creating internal list() and appending to an every growing list inside a loop (the big(O) of walking of n-days data is ((1+n)*n/2) or n-squared).

Perhaps consider storing each df[n].unique(), a np.array, into list, then do a single set creation over them, that’ll automatically unique() them.

from itertools import chain
temp = { k : [] for k in cat_names }
....
for x in octall:
    ....
    for n in cat_names:
        temp[n].append(df[n].unique()) # Just grab the ref to the np array for now, don't listify it.
        # don't add to catdict the big dict here.
    
# outside of both for-loops, create catdict in one pass.
catdict = { k : set(chain(*(temp[n])))  for k in cat_names }

try try? (the speedup should be additive to those from cudf.)

2 Likes

I’ll revisit with Numpy for a comparison. As it was, my hacky code was done when I came back from a conference call so I moved on to other issues.

1 Like

Neat trick to categorify large dataset :+1:! Large tabular dataset presents many interesting bottlenecks to optimize/streamline, this is one, and @muellerzr’s numpy as fastai2 tabular’s underlying array on host, vs cudf on GPU is another, RAPIDs or PyTorch for matrix/cuda manipulation… etc…etc… Every trick counts and thank you for sharing yours!

2 Likes

Increased my data to 15 months - 960 million rows x 42 cat_names.
My list method (with proper indentation on the pickle) starts pretty quickly but bogs down as the list comparisons grow longer.
List loops took 2 hours and 44.
Your chain method dropped that to 49 minutes. Much better.

Tried adding cudf to the tests but list loop was very slow and chain crashed with a read_parquet error. Not sure what was going on there as cudf.read_parquet is usually solid. For larger parquet files I often find opening in cudf and converting to pandas to be faster than opening in pandas directly.

3 Likes

Thank you so much for sharing the update — wow almost 4 hours benchmarking time, some serious effort you put in ! :bowing_man:t2:

Here are two things I discovered lately:

  1. pandas.unique() doesn’t sort, and is much faster than numpy.unique().
    And cudf.DataFrame.unique(), unfortunately sorts just like numpy.

  2. Parquet uses quite a bit more memory during load time for decompression.

(from https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d , search for “compare memory consumption during data process loading”) — would cudf.read_parquet() decompress using GPU memory?! :face_with_raised_eyebrow:

If you happen to run this stage again, perhaps watch GPU memory usage with nvidia-smi -l 1 to the point of crash.

You may also check the memory usage of these thingies:
df.memory_usage().sum(), df[n].memory_usage(), and df[n].unique().memory_usage().

Since you only have 42 categories and in the loop already using df[n].unique(), the stuff that chain() chains together should take very little memory.

[note: df[n].unique().memory_usage() is a cudf…errrr shall we say feature not a bug? It currently returns a cudf.Series() instead of an np.ndarray which pandas and numpy would return.]

By the way, do you mean it crashed inside the loop while reading from parquet file, or in the chain(...) call?

1 Like

A couple minutes writing code before bed. Servers put in the serious effort while I was snoozing. :wink:

Revisited CUDF+chain version and it ran out of GPU memory after 13 months. The error given just referenced read_parquet rather than the usual memory errors. I only have 42 categorical columns, but I do have millions of categories in one column so memory could add up. Test was flying through the daily files though.

Apparently I misread, understood now.

One column with millions of values surely is a tough one to categorify…

So managed to get around the proxy issues and got fastai2 working, gave your NumPy dataloader a spin.
Substituted my data for adults in your example successfully but when I tried to pre-categorize with my dict it really blew up the GPU memory. Couldn’t get it to work cutting the df by 75% and BS from 16k to 16. It did work when completely on CPU but lordy was that slow.

What would it take to bypass TabularPandas and just preprocess into NumPy?

I’m working on that currently, give me a day or two :wink:

1 Like

Sounds good. I’ll see what I can cook up and up my CPU batch size to 1024*512 in the meantime :laughing:

The TL:DR is you’d want to preprocess it all in numpy and recreate the following 3 scenarios (also what I am working on finishing up today/tommorow)

  1. Make two mapping dictionaries for taking the raw text category to an index for us to use
  2. Normalize our data and keep track of those statistics (for our continuous)
  3. Add a #na# tag for missing values and add an extra column for a boolean is_missing value (true or false). We’d need to update our dictionary in (1) for this to work. Along with a particular value: FillMissing supports utilizing the mean, mode, or a particular value
1 Like

Leaving my largest categorical out of the model for now lets everything fit on the GPU. I have a couple of categoricals with over half a million categories, so still a lot to deal with.

With 2-3 million rows/day, training is only taking 14 seconds with BS = 2^14. The full cycle opening a daily in cudf, filtering target rows and concatting multiples to decrease rarity, looping through and categorizing with the big dict, back to pandas, tabularpandas, numpydataloader and then training takes about 50 seconds. That is so much better than I was getting with fastai v1.

You are really opening up the possibilities with tabular. Thanks! :+1:

1 Like

:eyes: now that is a batch size right there! Glad it’s working out! Gives me extra motivation to finish the final features :wink:

1 Like

Even had a Rapids/fastai post on Medium some time back with massive batch sizes. I had never considered going that big but as long as you end up with plenty of batches it works pretty well.

1 Like