I’ve been playing around with this myself, but I can’t figure out how to properly subclass the TabularList. What I have done is subclass the torch Dataset, randomly serving up an entire partition of a dask dataframe as a batch. I preprocess the dataframe, which takes time, but should work with out of memory dataframes.
class DaskPartDataset(Dataset):
def __init__(self, df, target_col, cat_names):
self.df = df
self.target = target_col
self.cat_names = cat_names
self.cats = {}
# Process categories
for n in self.cat_names + [self.target]:
self.df[n] = self.df[n].astype('category').cat.as_known()
self.cats[n] = self.df[n].cat.categories
self.df[n] = self.df[n].cat.codes.astype(np.int64)
# Process continuous
self.cont_names = list(set(self.df.columns) - set(self.cat_names) - set([self.target]))
medians, means, stds = {}, {}, {}
for n in self.cont_names:
medians[n] = self.df[n].quantile(0.5)
means[n] = self.df[n].mean()
stds[n] = self.df[n].std()
self.df[n] = self.df[n].astype(np.float32)
self.medians, self.means, self.stds = dask.compute(medians, means, stds)
def __len__(self):
return self.df.npartitions
def __getitem__(self, i):
print(i)
df = self.df.get_partition(i).compute()
# Process continuous on each partition
for n in self.cont_names:
df[n] = df[n].fillna(self.medians[n])
df[n] = (df.loc[:,n] - self.means[n]) / (1e-7 + self.stds[n])
x_cont = df[self.cont_names].values
x_cat = df[self.cat_names].values
y = df[self.target].values
print('Done', df.shape)
return [tensor(x_cat), tensor(x_cont)], tensor(y)
So e.g. for the ADULT_SAMPLE dataset, I’m loading my dask dataframe, df (I know this is a tiny example that doesn’t need dask, but I’m using it to prototype).
df = dd.read_csv(path/‘adult.csv’, blocksize=1e6)
training_set = DaskPartDataset(df, ‘salary’, cat_names)
training_set.c = 2
training_set.classes = [‘>=50k’,‘<50k’]
I’m then creating my data bunch as:
data = TabularDataBunch.create(training_set, Valid_Set, bs=1, num_workers=0)
and manually setting
data.get_emb_szs = lambda a: [(len(cats), 15) for x in cat_names]
This seems to create the model and learner correctly, but what isn’t working are calls to fit… I’m getting an error around the forward pass of the embeddings.
RuntimeError: index out of range at /opt/conda/conda-bld/pytorch-cpu_1549632688322/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:191
Does anyone have any insights?