I use TabularList.from_df to construct my own tabular databunch. If the dtypes of ‘train_df’ and ‘test_df’ is float64, it runs about 15minutes, but if I change the dtypes of ‘train_df’ and ‘test_df’ to be np.float32, it runs several hours to construct the databunch. The code is as follows:
train_df: 2,010,000 * 1,140
since = time.time()
procs = [Categorify, Normalize]
test = TabularList.from_df(test_df, cat_names=cat_cols, cont_names=cont_cols)
data = (TabularList.from_df(train_df, cat_names=cat_cols, cont_names=cont_cols, procs=procs)
.split_by_idx(val_idx)
.label_from_df(cols=‘label’)
.add_test(test)
.databunch())
end = time.time()
print("elasped time: ", end - since)
I have the same problem. You can isolate the problem running the lines from the datablock separately.
For me, the line label_from_df is the line that runs super slow. I think that it is this line that runs the preprocessing.
Try:
The architecture in v1 isn’t accessing things in dataframes smartly, so there is nothing we can do to speed this up AFAIK. v2 will read batches directly in the processed dataframed thus being much quicker.