I have a structured dataset of around 100 gigs, and I am using DNN for classification. Because of this huge dataset, I cannot load entire data in memory for training. So, I’ll be reading data in batches to train the model.
Now, the input to the network should be normalized and for that, I need training dataset mean and SD . I have read many articles on normalization and all of them assume that data fit into the memory and conveniently calculate the mean and SD for each feature. But for most real-world datasets that is not the case.
So, How should one go about normalizing input features with training dataset mean and SD, while loading data in batches and training model?
You could likely just use an estimated mean/SD from a subset. If you just call
normalize without providing statistics fastai will collect them from a few batches for you. Though this may well not be suitable for tabular data where higher variance and potential trends across data could cause issues. But you could likely just take a few batches across the whole dataset and get a reasonable estimate.
Or if you wanted to calculate the mean/SD of the full data you would need to use one of the algorithms for collecting running statistics. Here’s my implementation of one which seems to work reasonably well. I implemented it for image/audio data so just a few channels but it is a generic algorithm and pretty generic implementation. I think setting
n_dims to 1 ad providing
(column,row) shaped data should work for tabular. it returns as many channels of statistics as you provide, and allows collecting statistics across multiple dimensions, flattening the rightmost
It seems to be pretty numerically stable, better than others I tried. I adapted the code from Apache Spark.