Load_empty doesn't load classes

Hey there,

I’ve trained a text classifier successfully using the RNNLearner, and I’m also able to read it back and give predictions on new data. I’m using the same TextDataBunch I used for training in order to recreate the classifier (the same preprocessing, vocabulary and classes).

The problem is that loading the original TextDataBunch is very heavy both in load time and memory, so it’s not going to scale. I’m trying to use TextDataBunch.export together with TextDataBunch.load_empty. This doesn’t work for me at the moment because it seems that the exported pkl file doesn’t contain the classes used. This is my code for loading:

data_clas = TextClasDataBunch.load_empty(path=path, fname=data_clas_fname)
classifier = text_classifier_learner(data_clas, drop_mult=0.5)
classifier.load(model_saved_name)

This is the end of the traceback of the error I’m getting:

def DataLoader___getattr__(dl, k:str)->Any: return getattr(dl.dataset, k)
  File "/usr/lib/python3.7/site-packages/fastai/data_block.py", line 522, in c
    def c(self): return self.y.c
  File "/usr/lib/python3.7/site-packages/fastai/data_block.py", line 302, in c
    def c(self): return len(self.classes)
TypeError: object of type 'NoneType' has no len()

This error made me realize that the classes are not loaded, so I loaded them manually to the object like so:

data_clas = TextClasDataBunch.load_empty(path=path, fname=data_clas_fname)
classes = open(f'{data_clas_path}/classes.txt').read().splitlines()
data_clas.single_ds.y.classes = classes
data_clas.y.classes = classes

This works but it’s very ugly. Can anyone tell me if I’m missing something here? If not and this is a missing feature in the package, I’d be happy to fix it with a PR but I’m going to need some guidance on where to change it.

Which version of the library are you using? Are you sure you exported the data_clas object and not the data_lm? It should normally work without having to open the classes.

1 Like

Thanks for the answer, Sylvain.
I’m using v1.0.39. and yes, I exported the data_clas, not the data_lm. But even the data_lm should have a single constant class 0 (if I understood the code correctly).

Have you tested the function with TextDataBunch or just with ImageDataBunch? LabelList.databunch is returning ImageDataBunch according to the type hints

def databunch(self, path:PathOrStr=None, **kwargs)->'ImageDataBunch':

It was tested with everything in the tutorial for inference (now we use Learner.export in master, which saves the empty DataBunch with the whole model).

1 Like

Couldn’t find reference for Learner.export in the docs nor in the code.

It’s in master (future v1.0.40): https://github.com/fastai/fastai/blob/770219f880e2bc79d455b9b4a6d875ceefb9e634/fastai/basic_train.py#L204

Note that we didn’t change anything to DataBunch.export and in my case, doing:

data_clas.export()
empty_data = TextClasDataBunch.load_empty(path)
empty_data.classes

gives me the correct clases without any problem.

1 Like

I’m in v1.0.54, and still have this problem.
And I don’t see it at all in the state.

state['data']

{'x_cls': fastai.text.data.TextList,
 'x_proc': [<fastai.text.data.TokenizeProcessor at 0x7f21119644a8>,
  <fastai.text.data.NumericalizeProcessor at 0x7f21119647f0>],
 'y_cls': fastai.data_block.CategoryList,
 'y_proc': [],
 'tfms': None,
 'tfm_y': False,
 'tfmargs': {},
 'tfms_y': None,
 'tfmargs_y': {}}

It might related to how the data_clas generated. Those generated by .from_ids will lost the classes, .from_folder is fine though.

data_clas = (TextList.from_folder(path, vocab=vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

data_clas = TextClasDataBunch.from_ids(
    '', vocab, 
    train_ids=trn_clas, 
    valid_ids=val_clas,
    train_lbls=trn_labels, 
    valid_lbls=val_labels, 
    classes=classes,
    bs=bs,
    test_ids=tst_clas
)

I’ve created a pull request to fix this.

1 Like