Multilabel classification with ULMFiT in Fastai v1

Thank you!

Now I get another error:

FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  res[x] = 1.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-126d84ffee9a> in <module>()
      1 multilabel_classifier.load_encoder('lm_encoder')
      2 multilabel_classifier.freeze()
----> 3 multilabel_classifier.fit_one_cycle(1, 1e-2, moms = (0.8,0.7))

~/.local/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     19     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     20                                         pct_start=pct_start, **kwargs))
---> 21     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     22 
     23 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     87             if hasattr(data,'valid_dl') and data.valid_dl is not None and data.valid_ds is not None:
     88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
---> 89                                        cb_handler=cb_handler, pbar=pbar)
     90             else: val_loss=None
     91             if cb_handler.on_epoch_end(val_loss): break

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     47     with torch.no_grad():
     48         val_losses,nums = [],[]
---> 49         for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
     50             if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
     51             val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))

~/.local/lib/python3.6/site-packages/fastprogress/fastprogress.py in __iter__(self)
     63         self.update(0)
     64         try:
---> 65             for i,o in enumerate(self._gen):
     66                 yield o
     67                 if self.auto_update: self.update(i+1)

~/.local/lib/python3.6/site-packages/fastai/basic_data.py in __iter__(self)
     68     def __iter__(self):
     69         "Process and returns items from `DataLoader`."
---> 70         for b in self.dl:
     71             #y = b[1][0] if is_listy(b[1]) else b[1] # XXX: Why is this line here?
     72             yield self.proc_batch(b)

~/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    613         if self.num_workers == 0:  # same-process loading
    614             indices = next(self.sample_iter)  # may raise StopIteration
--> 615             batch = self.collate_fn([self.dataset[i] for i in indices])
    616             if self.pin_memory:
    617                 batch = pin_memory_batch(batch)

~/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in <listcomp>(.0)
    613         if self.num_workers == 0:  # same-process loading
    614             indices = next(self.sample_iter)  # may raise StopIteration
--> 615             batch = self.collate_fn([self.dataset[i] for i in indices])
    616             if self.pin_memory:
    617                 batch = pin_memory_batch(batch)

~/.local/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
    490     def __getitem__(self,idxs:Union[int,np.ndarray])->'LabelList':
    491         if isinstance(try_int(idxs), int):
--> 492             if self.item is None: x,y = self.x[idxs],self.y[idxs]
    493             else:                 x,y = self.item   ,0
    494             if self.tfms:

~/.local/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
     90 
     91     def __getitem__(self,idxs:int)->Any:
---> 92         if isinstance(try_int(idxs), int): return self.get(idxs)
     93         else: return self.new(self.items[idxs], xtra=index_row(self.xtra, idxs))
     94 

~/.local/lib/python3.6/site-packages/fastai/data_block.py in get(self, i)
    330         o = self.items[i]
    331         if o is None: return None
--> 332         return MultiCategory(one_hot(o, self.c), [self.classes[p] for p in o], o)
    333 
    334     def analyze_pred(self, pred, thresh:float=0.5):

~/.local/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
    330         o = self.items[i]
    331         if o is None: return None
--> 332         return MultiCategory(one_hot(o, self.c), [self.classes[p] for p in o], o)
    333 
    334     def analyze_pred(self, pred, thresh:float=0.5):

TypeError: list indices must be integers or slices, not NoneType

Iā€™ve already made sure that there are no classes in the valid_df that are not present in train_df. What can I try to fix it?

Update: after installng ā€œfreshā€ developer version of fastai directly from github, i.e. 1.0.36.post1, it looks like it works. Thank you!

2 Likes

Hi @annanana, Can you share your working code please??

from fastai import *
from fastai.text import *
from fastai.callbacks.tracker import EarlyStoppingCallback
from fastai.callbacks.tracker import SaveModelCallback
from fastai.callbacks.tracker import ReduceLROnPlateauCallback
import pandas as pd

import fastai
print("Used fastai and torch version:")
print(fastai.__version__, torch.__version__)

torch.cuda.set_device(0) # torch.cuda.empty_cache()
path = Path("ULMFiT_Small_Corpus_Only")
print("\n")

print("=================================")
print("Reading in Train/valid/test data:")
print("==================================")
train_df = pd.read_json('01_corpus/train.jsonl', lines=True)
valid_df = pd.read_json('01_corpus/valid.jsonl', lines=True)
test_df = pd.read_json('01_corpus/test.jsonl', lines=True)
entire_small_corpus = pd.read_json('01_corpus/entire_corpus.jsonl', lines=True)

print(str(train_df.shape)+ ", " + str(valid_df.shape) + ", " + 
      str(test_df.shape) + ", " + str(entire_small_corpus.shape))
train_df.head()

torch.cuda.get_device_name(0)

torch.cuda.get_device_properties(0)

lm_train, lm_valid = train_test_split(entire_small_corpus, 
                                      test_size = 0.05, random_state = 0)
print("Number of rows in train: " + str(len(lm_train)) + 
      ", valid: " + str(len(lm_valid)))

data_lm_small = TextLMDataBunch.from_df(path, lm_train, lm_valid, 
                                        bs=16, text_cols = ['text', 'fulltext'], 
                                  max_vocab = 60000, min_freq = 2)
data_lm_small.save('data_lm_small')
print(f"Language model vocab size: {len(data_lm_small.vocab.itos)}.")
print("data_lm saved to: " + str(path))

save_model = partial(SaveModelCallback, monitor='val_loss', every='improvement', name='best_lm_small')
early_stop = partial(EarlyStoppingCallback, monitor='val_loss', min_delta=0.01, patience=2)

# bptt = 70, emb_sz = 400, nh = 1150, nl = 3, drop_mult = 1, wd = 0, 
lm_learner = language_model_learner(data_lm_small, pretrained_model=URLs.WT103_1, drop_mult=1,
                               callback_fns = [save_model, early_stop])

print("=================================")
print("Summary of the model's structure:")
print(lm_learner.model) # summary

lm_learner.load('best_lm_small')

lm_learner.save_encoder('best_lm_small_encoder')

classifier_data_1 = TextClasDataBunch.from_df(path, train_df, valid_df, test_df, bs=16, 
                                            max_vocab = 60000, min_freq = 2, 
                                            vocab = data_lm_small.train_ds.vocab,
                                            text_cols = ['text', 'fulltext'], 
                                              label_cols ='rm')
classifier_data_1.save('classifier_data_1')
print(f"Classifier vocab size: {len(classifier_data_1.vocab.itos)}.")
print("classifier_data_1 saved")

print("Define the training loop for the target task, i.e. Classification:")
save_1 = partial(SaveModelCallback, monitor='accuracy_thresh', 
                 every='improvement', name='best_accuracy_classifier_1')

early_stop = partial(EarlyStoppingCallback, monitor='val_loss', min_delta=0.01, patience=3)
lr_schedule = partial(ReduceLROnPlateauCallback, monitor='val_loss', patience=1, factor=1.5, min_delta=0.1) 

# bptt = 70, emb_sz = 400, nh = 1150, nl = 3, drop_mult = 1, wd = 0, 
classifier_1 = text_classifier_learner(classifier_data_1, 
                                    callback_fns = [save_1, early_stop, lr_schedule],
                                    metrics = [accuracy_thresh, fbeta])
print("\n")
print("classifier_1 defined. Model's summary:")
print("\n")
print(classifier_1.model) # summary

classifier_1.load_encoder('best_lm_small_encoder')
classifier_1.freeze()
classifier_1.fit_one_cycle(1, 2e-2, moms=(0.8, 0.7)) # 1	0.324927	0.296019	0.881143	0.563900

classifier_1.save('classifier_last_layer_tuned')

4 Likes

Thank you so much @annanana

Hi @annanana, Iā€™m facing issue while doing inference,

Can you please help me in resolving this.

@sgugger and @annanana
Iā€™m a different issue with version 1.0.39.dev0 ,

I had to pass accuracy thresh as an iterable, essentially,

learn = text_classifier_learner(classifier_data_1, metrics=[accuracy_thresh])

and learn.metrics looks right I think -

[<function fastai.metrics.accuracy_thresh(y_pred: torch.Tensor, y_true: torch.Tensor, thresh: float = 0.5, sigmoid: bool = True) -> <function NewType.<locals>.new_type at 0x7f59c66aebf8>>]

and the error I get when calling fit_one_cycle(ā€¦) is

RuntimeError: The size of tensor a (31) must match the size of tensor b (16) at non-singleton dimension 1

my labels arenā€™t one-hot encoded, just a column of strings, I have 31 of them overall, and the last layer of my net also has 31 output features, but I donā€™t understand whatā€™s causing my tensor length to be 16

SequentialRNN(
  (0): MultiBatchRNNCore(
    (encoder): Embedding(5987, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(5987, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.4)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=31, bias=True)
    )
  )
)

Full stacktrace is

RuntimeError   Traceback (most recent call last)
<ipython-input-74-8ef94082412b> in <module>
      1 learn.freeze()
----> 2 learn.fit_one_cycle(1, 2e-2, moms=(0.8, 0.7))

~/fastai/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     21                                         pct_start=pct_start, **kwargs))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/fastai/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     87             if not data.empty_val:
     88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
---> 89                                        cb_handler=cb_handler, pbar=pbar)
     90             else: val_loss=None
     91             if cb_handler.on_epoch_end(val_loss): break

~/fastai/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     52             if not is_listy(yb): yb = [yb]
     53             nums.append(yb[0].shape[0])
---> 54             if cb_handler and cb_handler.on_batch_end(val_losses[-1]): break
     55             if n_batch and (len(nums)>=n_batch): break
     56         nums = np.array(nums, dtype=np.float32)

~/fastai/fastai/callback.py in on_batch_end(self, loss)
    237         "Handle end of processing one batch with `loss`."
    238         self.state_dict['last_loss'] = loss
--> 239         stop = np.any(self('batch_end', not self.state_dict['train']))
    240         if self.state_dict['train']:
    241             self.state_dict['iteration'] += 1

~/fastai/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    185     def __call__(self, cb_name, call_mets=True, **kwargs)->None:
    186         "Call through to all of the `CallbakHandler` functions."
--> 187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
    188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 

~/fastai/fastai/callback.py in <listcomp>(.0)
    185     def __call__(self, cb_name, call_mets=True, **kwargs)->None:
    186         "Call through to all of the `CallbakHandler` functions."
--> 187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
    188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 

~/fastai/fastai/callback.py in on_batch_end(self, last_output, last_target, **kwargs)
    272         if not is_listy(last_target): last_target=[last_target]
    273         self.count += last_target[0].size(0)
--> 274         self.val += last_target[0].size(0) * self.func(last_output, *last_target).detach().cpu()
    275 
    276     def on_epoch_end(self, **kwargs):

~/fastai/fastai/metrics.py in accuracy_thresh(y_pred, y_true, thresh, sigmoid)
     20     "Compute accuracy when `y_pred` and `y_true` are the same size."
     21     if sigmoid: y_pred = y_pred.sigmoid()
---> 22     return ((y_pred>thresh)==y_true.byte()).float().mean()
     23 
     24 def dice(input:FloatTensor, targs:LongTensor, iou:bool=False)->Rank0Tensor:

RuntimeError: The size of tensor a (31) must match the size of tensor b (16) at non-singleton dimension 1

Can you provide information of how your data looks like? maybe you could try to just create your data bunch from a dataframe or from files using from_dfor other helper functions instead of loading empty data dumps?

maybe you could also share how your data looks like. For example, in my case, comma separated strings didnā€™t work, but a list of labels worked. For example, this is my df:

If you have data in tidy format, you can use this helper function:

def get_all_labels_for_observation(observation):
    return labeled_df[labeled_df['Observation'] == observation].label.values

def add_label_list_column(row):
    row['Labels'] = get_all_labels_for_observation(row['Observation'])
    return row

labeled_df = labeled_df.apply(add_rm_list_column, axis=1)

and then just drop duplicates. This way you go from 1 observation per row to a multilabel format, like this:

2 Likes

Wow. So I had 1 category per row, I just wrapped it up into a list and it works now.

I donā€™t understand why though. Why would it expect a list of categories instead of just a category per row?

Anyway, thanks for all the help @annanana

Actually, @annanana when you ran learn.predict, what did you get? I just get a tuple out, which is (Multicategory, a tensor of all zeroes, tensor of what I think are the text embeddings)

learn.predict() is only for the LM., for example: print(lm_learner.predict('Your initial text', 100, temperature = 1.1, min_p = 0.001)). This way you get the new sequence generated by the LM.

For the classifier, use get_preds method, like so: classifier_output = classifier.get_preds(ds_type = "test_df", with_loss=False)
Then you get:

y_pred = classifier_output[0].numpy() #.astype(int) 
y_true = classifier_output[1].numpy()
1 Like

Hi @sgugger, I am using fastai 1.0.40, and am trying to create a multi-label text classifier with 26 labels. However, I havenā€™t been able to get MultiCategoryList to work. See screenshot for example of data, TextClasDataBunch use, and result. Could you advise? Thanks!

1 Like

You need to tell the library that your tags are lists of labels by passing label_delim=' ' (think itā€™s space from what I see in your screenshot, but the value may change depending on your dataset).

3 Likes

That was it! I see how the field is inherited from TextDataBunch.from_df now. Thanks again!

@annanana Hi, how do I print the same for classification ?
print some examples from the val set in the form input_text true_label predicted_lable

Hi I am trying to Implement ULMFIT for kaggles toxic comments dataset. Get it to run but feels like I am doing something wrong since the predictions arenā€™t that good. I could be that the dataset is very biase against giving it class none and therefore the accuracy looks got but it is not?

pdf in git

Does ā€œRNNLearnerā€ not have method ā€œlanguage_modelā€ any more? Should we use this "
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)" intstead? Thank you,