I’ve got a lot of data in daily files, each with around 3 million rows, 40 cats and 30 conts. I’m trying to build profiles on several million entities so that I can assess new transactions to see how likely they are for the given individual.
I have access to a large server with much RAM and 2 V100s but even so there is no way to fit more than a few days worth of data in a single file so I am looping through the data a day at a time.
The key challenge is consistent categorization. I solved this with a pre-proc step of building an ordered dict by looping through all the files before training. Initialize with a one day cat.dict and then loop through your list of files (octall):
temp = {}
olddict = {}
for x in octall:
df = pd.read_parquet('../data/fsmodel/'+x)
print(x)
for n in cat_names:
olddict[n] = catdict[n]
temp[n] = list(df[n].unique())
catdict[n] = catdict[n]+(list(set(temp[n]) - set(olddict[n])))
pickle.dump( catdict, open( "/shared/rb/data/fs/cats.dict", "wb" ) )
This took an hour to process a year. Using cudf would probably cut that time down considerably.
Converted to ordereddict afterwards.
Once that is ready, override Categorify to rely on the dict. CategoricalDtype simplifies this:
categories = pickle.load( open( “/shared/rb/data/fs/dict/cats.dict”, “rb” ))
cat_type = CategoricalDtype(categories=categories, ordered=True)
class Categorify(TabularProc):
def apply_train(self, df:DataFrame):
#df = df.to_cudf()
for n in cat_names:
df[n] = df[n].astype(cat_type)
#df = df.to_pandas()
Cudf does speed this up but my model is huge and I would crash on larger files so I am sticking to pandas. Uncomment the conversion lines if you can get away with it.
Training - I initialize with one standard cycle, save Learn and then hit the loop.
for x in octall:
df = pd.read_parquet('../data/fsmodel/'+x)[cat_names+cont_names+['TARGET']]
procs = [Categorify]
xx = int(len(df)*.04)
valid_idx = range(len(df)-xx, len(df))
dep_var = 'TARGET'
data = (TabularList.from_df(df[cat_names+cont_names+['TARGET']], path='fs3', cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(valid_idx=valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList)
.databunch(bs=1024*32))
learn.load('fs3')
print(data.show_batch(1))
learn.unfreeze()
lr = 0.0002
#learn.fit_one_cycle(3)
learn.fit(1)
learn.save('fs3)