I’m not sure if this is intended but the export.pkl is about 471 MB which is somewhat prohibitive in the deployment in certain applications.
The model itself from SaveModelCallback is only 131 KB and I’m only looking to use the Learner in order to apply the same transforms/processing (Normalization, FillMissing, Categorify).
Is there a reason this is so large? I’ve also confirmed
When you export a model (without the optimizer state), you basically need to save all the weights to disk. You can do a quick ballpark estimation of the expected file size depending on the number of model parameters (and assuming float32 = 4 bytes for each), but it’s likely to be several hundred megabytes.
Saying that it’s “prohibitive” to deployment in certain applications may be true for your use case, but that means that you likely cannot use neural networks at all (or you have to use specific architectures designed to be as light-weight as possible, which usually also impacts accuracy). Another option is to see what exactly prevents you from being able to deploy this model and trying to solve that problem.
There are options available right now @orendar and @jasonho28 what I would recommend is doing a torch.save() to save the weights and export the TabularPandas object instead. I would expect this could reduce the size. As a result during inference you’d go TP -> DataLoader rather than just a plain DL:
See the very bottom for a usage example, the library is wwf
@orendar@jasonho28@pierreguillou over the weekend Jeremy and I solved this issue, it was due to log_args plus some extraneous references inside of ReadTabBatch. Happy to report that my export.pkl is a calm cool 142.5 kb’s
I’m curious as to how this is even possible, as I don’t use tabular much - how many parameters are in the model? I don’t think I’ve ever seen a model weigh less than 100 mb on any of the libraries I’ve used.
fastai 2.2.0 still has the data leakage issue.
My naive screening showed similar patterns as you previously mentioned in [ReadTabBatch] (https://github.com/fastai/fastai/pull/2948). Many tabular transform are storing the intermediate data and was accidentally exported.
Ah, I see why that’s a thing. I have an interim PR that actually fixes that, that’s an inconsistency oversight on my part. Will lyk when that gets merged.
Interim this can fix it for folks:
@patch
def setups(self:FillMissing, dsets):
missing = pd.isnull(dsets.conts).any()
store_attr(but='dsets', na_dict={n:self.fill_strategy(dsets[n], self.fill_vals[n])
for n in missing[missing].keys()})
self.fill_strategy = self.fill_strategy.__name__