Big Tabular

I’ve got a lot of data in daily files, each with around 3 million rows, 40 cats and 30 conts. I’m trying to build profiles on several million entities so that I can assess new transactions to see how likely they are for the given individual.

I have access to a large server with much RAM and 2 V100s but even so there is no way to fit more than a few days worth of data in a single file so I am looping through the data a day at a time.

The key challenge is consistent categorization. I solved this with a pre-proc step of building an ordered dict by looping through all the files before training. Initialize with a one day cat.dict and then loop through your list of files (octall):

temp = {}
olddict = {}
for x in octall:
    df = pd.read_parquet('../data/fsmodel/'+x)
    print(x)
    for n in cat_names:
        olddict[n] = catdict[n]
        temp[n] = list(df[n].unique())
        catdict[n] = catdict[n]+(list(set(temp[n]) - set(olddict[n])))
    pickle.dump( catdict, open( "/shared/rb/data/fs/cats.dict", "wb" ) )

This took an hour to process a year. Using cudf would probably cut that time down considerably.
Converted to ordereddict afterwards.

Once that is ready, override Categorify to rely on the dict. CategoricalDtype simplifies this:

categories = pickle.load( open( “/shared/rb/data/fs/dict/cats.dict”, “rb” ))
cat_type = CategoricalDtype(categories=categories, ordered=True)

class Categorify(TabularProc):
    def apply_train(self, df:DataFrame):
        #df = df.to_cudf()
        for n in cat_names:
            df[n] = df[n].astype(cat_type)
       #df = df.to_pandas()

Cudf does speed this up but my model is huge and I would crash on larger files so I am sticking to pandas. Uncomment the conversion lines if you can get away with it.

Training - I initialize with one standard cycle, save Learn and then hit the loop.

for x in octall:
df = pd.read_parquet('../data/fsmodel/'+x)[cat_names+cont_names+['TARGET']]
procs = [Categorify]
xx = int(len(df)*.04)
valid_idx = range(len(df)-xx, len(df))
dep_var = 'TARGET'
data = (TabularList.from_df(df[cat_names+cont_names+['TARGET']], path='fs3', cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(valid_idx=valid_idx)
                           .label_from_df(cols=dep_var, label_cls=FloatList)
                           .databunch(bs=1024*32))
learn.load('fs3')
print(data.show_batch(1))
learn.unfreeze()
lr = 0.0002
#learn.fit_one_cycle(3)
learn.fit(1)
learn.save('fs3)

So that is currently training. Valid loss dropped for the first few days but has plateaued with maybe a tiny downtrend. Train loss continues to drop, even though I don’t believe the model is seeing data a second time. I also have a lot of dropout. Target is rare.

Any mistakes or suggestions?

If you feel like using v2, I made a NumPy DataLoader (very barebones) that takes in pre processed data (which you can do via TabularPandas, or there’s also a way to get your raw data from here. If you’re interested let me know). It’s got a 40% speed up per epoch on training.

And if you really want to, you could potentially preprocess with NumPy too, saving a ton of memory

1 Like

For whatever reason, the proxy at work just won’t work with PIP. I have been able to clone git and pip install local in the past but my recent attempts were failing trying to install things outside of the fastai universe. Now that this is functional I will fight that battle again.

1 Like

If your installs are in general up to date with what fastai2 wants, you may be able to clone and move around the directories to get to the right code instead (I’m doing something similar where I can’t directly install fastai2, so I’ve got fastcore stored away on the disk from git)

1 Like

I think the problematic call was to something with Torch and I couldn’t figure out the specific issue.

Hello @Ralph,

This could be expensive: creating internal list() and appending to an every growing list inside a loop (the big(O) of walking of n-days data is ((1+n)*n/2) or n-squared).

Perhaps consider storing each df[n].unique(), a np.array, into list, then do a single set creation over them, that’ll automatically unique() them.

from itertools import chain
temp = { k : [] for k in cat_names }
....
for x in octall:
    ....
    for n in cat_names:
        temp[n].append(df[n].unique()) # Just grab the ref to the np array for now, don't listify it.
        # don't add to catdict the big dict here.
    
# outside of both for-loops, create catdict in one pass.
catdict = { k : set(chain(*(temp[n])))  for k in cat_names }

try try? (the speedup should be additive to those from cudf.)

2 Likes

I’ll revisit with Numpy for a comparison. As it was, my hacky code was done when I came back from a conference call so I moved on to other issues.

1 Like

Neat trick to categorify large dataset :+1:! Large tabular dataset presents many interesting bottlenecks to optimize/streamline, this is one, and @muellerzr’s numpy as fastai2 tabular’s underlying array on host, vs cudf on GPU is another, RAPIDs or PyTorch for matrix/cuda manipulation… etc…etc… Every trick counts and thank you for sharing yours!

2 Likes

Increased my data to 15 months - 960 million rows x 42 cat_names.
My list method (with proper indentation on the pickle) starts pretty quickly but bogs down as the list comparisons grow longer.
List loops took 2 hours and 44.
Your chain method dropped that to 49 minutes. Much better.

Tried adding cudf to the tests but list loop was very slow and chain crashed with a read_parquet error. Not sure what was going on there as cudf.read_parquet is usually solid. For larger parquet files I often find opening in cudf and converting to pandas to be faster than opening in pandas directly.

2 Likes

Thank you so much for sharing the update — wow almost 4 hours benchmarking time, some serious effort you put in ! :bowing_man:t2:

Here are two things I discovered lately:

  1. pandas.unique() doesn’t sort, and is much faster than numpy.unique().
    And cudf.DataFrame.unique(), unfortunately sorts just like numpy.

  2. Parquet uses quite a bit more memory during load time for decompression.

(from https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d , search for “compare memory consumption during data process loading”) — would cudf.read_parquet() decompress using GPU memory?! :face_with_raised_eyebrow:

If you happen to run this stage again, perhaps watch GPU memory usage with nvidia-smi -l 1 to the point of crash.

You may also check the memory usage of these thingies:
df.memory_usage().sum(), df[n].memory_usage(), and df[n].unique().memory_usage().

Since you only have 42 categories and in the loop already using df[n].unique(), the stuff that chain() chains together should take very little memory.

[note: df[n].unique().memory_usage() is a cudf…errrr shall we say feature not a bug? It currently returns a cudf.Series() instead of an np.ndarray which pandas and numpy would return.]

By the way, do you mean it crashed inside the loop while reading from parquet file, or in the chain(...) call?

1 Like

A couple minutes writing code before bed. Servers put in the serious effort while I was snoozing. :wink:

Revisited CUDF+chain version and it ran out of GPU memory after 13 months. The error given just referenced read_parquet rather than the usual memory errors. I only have 42 categorical columns, but I do have millions of categories in one column so memory could add up. Test was flying through the daily files though.

Apparently I misread, understood now.

One column with millions of values surely is a tough one to categorify…