Multilabel classification with ULMFiT in Fastai v1

Thank you!

Now I get another error:

FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  res[x] = 1.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-126d84ffee9a> in <module>()
      1 multilabel_classifier.load_encoder('lm_encoder')
      2 multilabel_classifier.freeze()
----> 3 multilabel_classifier.fit_one_cycle(1, 1e-2, moms = (0.8,0.7))

~/.local/lib/python3.6/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     19     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     20                                         pct_start=pct_start, **kwargs))
---> 21     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     22 
     23 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     87             if hasattr(data,'valid_dl') and data.valid_dl is not None and data.valid_ds is not None:
     88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
---> 89                                        cb_handler=cb_handler, pbar=pbar)
     90             else: val_loss=None
     91             if cb_handler.on_epoch_end(val_loss): break

~/.local/lib/python3.6/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     47     with torch.no_grad():
     48         val_losses,nums = [],[]
---> 49         for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
     50             if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
     51             val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))

~/.local/lib/python3.6/site-packages/fastprogress/fastprogress.py in __iter__(self)
     63         self.update(0)
     64         try:
---> 65             for i,o in enumerate(self._gen):
     66                 yield o
     67                 if self.auto_update: self.update(i+1)

~/.local/lib/python3.6/site-packages/fastai/basic_data.py in __iter__(self)
     68     def __iter__(self):
     69         "Process and returns items from `DataLoader`."
---> 70         for b in self.dl:
     71             #y = b[1][0] if is_listy(b[1]) else b[1] # XXX: Why is this line here?
     72             yield self.proc_batch(b)

~/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    613         if self.num_workers == 0:  # same-process loading
    614             indices = next(self.sample_iter)  # may raise StopIteration
--> 615             batch = self.collate_fn([self.dataset[i] for i in indices])
    616             if self.pin_memory:
    617                 batch = pin_memory_batch(batch)

~/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py in <listcomp>(.0)
    613         if self.num_workers == 0:  # same-process loading
    614             indices = next(self.sample_iter)  # may raise StopIteration
--> 615             batch = self.collate_fn([self.dataset[i] for i in indices])
    616             if self.pin_memory:
    617                 batch = pin_memory_batch(batch)

~/.local/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
    490     def __getitem__(self,idxs:Union[int,np.ndarray])->'LabelList':
    491         if isinstance(try_int(idxs), int):
--> 492             if self.item is None: x,y = self.x[idxs],self.y[idxs]
    493             else:                 x,y = self.item   ,0
    494             if self.tfms:

~/.local/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
     90 
     91     def __getitem__(self,idxs:int)->Any:
---> 92         if isinstance(try_int(idxs), int): return self.get(idxs)
     93         else: return self.new(self.items[idxs], xtra=index_row(self.xtra, idxs))
     94 

~/.local/lib/python3.6/site-packages/fastai/data_block.py in get(self, i)
    330         o = self.items[i]
    331         if o is None: return None
--> 332         return MultiCategory(one_hot(o, self.c), [self.classes[p] for p in o], o)
    333 
    334     def analyze_pred(self, pred, thresh:float=0.5):

~/.local/lib/python3.6/site-packages/fastai/data_block.py in <listcomp>(.0)
    330         o = self.items[i]
    331         if o is None: return None
--> 332         return MultiCategory(one_hot(o, self.c), [self.classes[p] for p in o], o)
    333 
    334     def analyze_pred(self, pred, thresh:float=0.5):

TypeError: list indices must be integers or slices, not NoneType

I’ve already made sure that there are no classes in the valid_df that are not present in train_df. What can I try to fix it?

Update: after installng “fresh” developer version of fastai directly from github, i.e. 1.0.36.post1, it looks like it works. Thank you!

2 Likes

Hi @annanana, Can you share your working code please??

from fastai import *
from fastai.text import *
from fastai.callbacks.tracker import EarlyStoppingCallback
from fastai.callbacks.tracker import SaveModelCallback
from fastai.callbacks.tracker import ReduceLROnPlateauCallback
import pandas as pd

import fastai
print("Used fastai and torch version:")
print(fastai.__version__, torch.__version__)

torch.cuda.set_device(0) # torch.cuda.empty_cache()
path = Path("ULMFiT_Small_Corpus_Only")
print("\n")

print("=================================")
print("Reading in Train/valid/test data:")
print("==================================")
train_df = pd.read_json('01_corpus/train.jsonl', lines=True)
valid_df = pd.read_json('01_corpus/valid.jsonl', lines=True)
test_df = pd.read_json('01_corpus/test.jsonl', lines=True)
entire_small_corpus = pd.read_json('01_corpus/entire_corpus.jsonl', lines=True)

print(str(train_df.shape)+ ", " + str(valid_df.shape) + ", " + 
      str(test_df.shape) + ", " + str(entire_small_corpus.shape))
train_df.head()

torch.cuda.get_device_name(0)

torch.cuda.get_device_properties(0)

lm_train, lm_valid = train_test_split(entire_small_corpus, 
                                      test_size = 0.05, random_state = 0)
print("Number of rows in train: " + str(len(lm_train)) + 
      ", valid: " + str(len(lm_valid)))

data_lm_small = TextLMDataBunch.from_df(path, lm_train, lm_valid, 
                                        bs=16, text_cols = ['text', 'fulltext'], 
                                  max_vocab = 60000, min_freq = 2)
data_lm_small.save('data_lm_small')
print(f"Language model vocab size: {len(data_lm_small.vocab.itos)}.")
print("data_lm saved to: " + str(path))

save_model = partial(SaveModelCallback, monitor='val_loss', every='improvement', name='best_lm_small')
early_stop = partial(EarlyStoppingCallback, monitor='val_loss', min_delta=0.01, patience=2)

# bptt = 70, emb_sz = 400, nh = 1150, nl = 3, drop_mult = 1, wd = 0, 
lm_learner = language_model_learner(data_lm_small, pretrained_model=URLs.WT103_1, drop_mult=1,
                               callback_fns = [save_model, early_stop])

print("=================================")
print("Summary of the model's structure:")
print(lm_learner.model) # summary

lm_learner.load('best_lm_small')

lm_learner.save_encoder('best_lm_small_encoder')

classifier_data_1 = TextClasDataBunch.from_df(path, train_df, valid_df, test_df, bs=16, 
                                            max_vocab = 60000, min_freq = 2, 
                                            vocab = data_lm_small.train_ds.vocab,
                                            text_cols = ['text', 'fulltext'], 
                                              label_cols ='rm')
classifier_data_1.save('classifier_data_1')
print(f"Classifier vocab size: {len(classifier_data_1.vocab.itos)}.")
print("classifier_data_1 saved")

print("Define the training loop for the target task, i.e. Classification:")
save_1 = partial(SaveModelCallback, monitor='accuracy_thresh', 
                 every='improvement', name='best_accuracy_classifier_1')

early_stop = partial(EarlyStoppingCallback, monitor='val_loss', min_delta=0.01, patience=3)
lr_schedule = partial(ReduceLROnPlateauCallback, monitor='val_loss', patience=1, factor=1.5, min_delta=0.1) 

# bptt = 70, emb_sz = 400, nh = 1150, nl = 3, drop_mult = 1, wd = 0, 
classifier_1 = text_classifier_learner(classifier_data_1, 
                                    callback_fns = [save_1, early_stop, lr_schedule],
                                    metrics = [accuracy_thresh, fbeta])
print("\n")
print("classifier_1 defined. Model's summary:")
print("\n")
print(classifier_1.model) # summary

classifier_1.load_encoder('best_lm_small_encoder')
classifier_1.freeze()
classifier_1.fit_one_cycle(1, 2e-2, moms=(0.8, 0.7)) # 1	0.324927	0.296019	0.881143	0.563900

classifier_1.save('classifier_last_layer_tuned')

4 Likes

Thank you so much @annanana

Hi @annanana, I’m facing issue while doing inference,

Can you please help me in resolving this.

@sgugger and @annanana
I’m a different issue with version 1.0.39.dev0 ,

I had to pass accuracy thresh as an iterable, essentially,

learn = text_classifier_learner(classifier_data_1, metrics=[accuracy_thresh])

and learn.metrics looks right I think -

[<function fastai.metrics.accuracy_thresh(y_pred: torch.Tensor, y_true: torch.Tensor, thresh: float = 0.5, sigmoid: bool = True) -> <function NewType.<locals>.new_type at 0x7f59c66aebf8>>]

and the error I get when calling fit_one_cycle(…) is

RuntimeError: The size of tensor a (31) must match the size of tensor b (16) at non-singleton dimension 1

my labels aren’t one-hot encoded, just a column of strings, I have 31 of them overall, and the last layer of my net also has 31 output features, but I don’t understand what’s causing my tensor length to be 16

SequentialRNN(
  (0): MultiBatchRNNCore(
    (encoder): Embedding(5987, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(5987, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.4)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=31, bias=True)
    )
  )
)

Full stacktrace is

RuntimeError   Traceback (most recent call last)
<ipython-input-74-8ef94082412b> in <module>
      1 learn.freeze()
----> 2 learn.fit_one_cycle(1, 2e-2, moms=(0.8, 0.7))

~/fastai/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, wd, callbacks, **kwargs)
     20     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor,
     21                                         pct_start=pct_start, **kwargs))
---> 22     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     23 
     24 def lr_find(learn:Learner, start_lr:Floats=1e-7, end_lr:Floats=10, num_it:int=100, stop_div:bool=True, **kwargs:Any):

~/fastai/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

~/fastai/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     87             if not data.empty_val:
     88                 val_loss = validate(model, data.valid_dl, loss_func=loss_func,
---> 89                                        cb_handler=cb_handler, pbar=pbar)
     90             else: val_loss=None
     91             if cb_handler.on_epoch_end(val_loss): break

~/fastai/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     52             if not is_listy(yb): yb = [yb]
     53             nums.append(yb[0].shape[0])
---> 54             if cb_handler and cb_handler.on_batch_end(val_losses[-1]): break
     55             if n_batch and (len(nums)>=n_batch): break
     56         nums = np.array(nums, dtype=np.float32)

~/fastai/fastai/callback.py in on_batch_end(self, loss)
    237         "Handle end of processing one batch with `loss`."
    238         self.state_dict['last_loss'] = loss
--> 239         stop = np.any(self('batch_end', not self.state_dict['train']))
    240         if self.state_dict['train']:
    241             self.state_dict['iteration'] += 1

~/fastai/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    185     def __call__(self, cb_name, call_mets=True, **kwargs)->None:
    186         "Call through to all of the `CallbakHandler` functions."
--> 187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
    188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 

~/fastai/fastai/callback.py in <listcomp>(.0)
    185     def __call__(self, cb_name, call_mets=True, **kwargs)->None:
    186         "Call through to all of the `CallbakHandler` functions."
--> 187         if call_mets: [getattr(met, f'on_{cb_name}')(**self.state_dict, **kwargs) for met in self.metrics]
    188         return [getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs) for cb in self.callbacks]
    189 

~/fastai/fastai/callback.py in on_batch_end(self, last_output, last_target, **kwargs)
    272         if not is_listy(last_target): last_target=[last_target]
    273         self.count += last_target[0].size(0)
--> 274         self.val += last_target[0].size(0) * self.func(last_output, *last_target).detach().cpu()
    275 
    276     def on_epoch_end(self, **kwargs):

~/fastai/fastai/metrics.py in accuracy_thresh(y_pred, y_true, thresh, sigmoid)
     20     "Compute accuracy when `y_pred` and `y_true` are the same size."
     21     if sigmoid: y_pred = y_pred.sigmoid()
---> 22     return ((y_pred>thresh)==y_true.byte()).float().mean()
     23 
     24 def dice(input:FloatTensor, targs:LongTensor, iou:bool=False)->Rank0Tensor:

RuntimeError: The size of tensor a (31) must match the size of tensor b (16) at non-singleton dimension 1

Can you provide information of how your data looks like? maybe you could try to just create your data bunch from a dataframe or from files using from_dfor other helper functions instead of loading empty data dumps?

maybe you could also share how your data looks like. For example, in my case, comma separated strings didn’t work, but a list of labels worked. For example, this is my df:

If you have data in tidy format, you can use this helper function:

def get_all_labels_for_observation(observation):
    return labeled_df[labeled_df['Observation'] == observation].label.values

def add_label_list_column(row):
    row['Labels'] = get_all_labels_for_observation(row['Observation'])
    return row

labeled_df = labeled_df.apply(add_rm_list_column, axis=1)

and then just drop duplicates. This way you go from 1 observation per row to a multilabel format, like this:

2 Likes

Wow. So I had 1 category per row, I just wrapped it up into a list and it works now.

I don’t understand why though. Why would it expect a list of categories instead of just a category per row?

Anyway, thanks for all the help @annanana

Actually, @annanana when you ran learn.predict, what did you get? I just get a tuple out, which is (Multicategory, a tensor of all zeroes, tensor of what I think are the text embeddings)

learn.predict() is only for the LM., for example: print(lm_learner.predict('Your initial text', 100, temperature = 1.1, min_p = 0.001)). This way you get the new sequence generated by the LM.

For the classifier, use get_preds method, like so: classifier_output = classifier.get_preds(ds_type = "test_df", with_loss=False)
Then you get:

y_pred = classifier_output[0].numpy() #.astype(int) 
y_true = classifier_output[1].numpy()
1 Like

Hi @sgugger, I am using fastai 1.0.40, and am trying to create a multi-label text classifier with 26 labels. However, I haven’t been able to get MultiCategoryList to work. See screenshot for example of data, TextClasDataBunch use, and result. Could you advise? Thanks!

1 Like

You need to tell the library that your tags are lists of labels by passing label_delim=' ' (think it’s space from what I see in your screenshot, but the value may change depending on your dataset).

3 Likes

That was it! I see how the field is inherited from TextDataBunch.from_df now. Thanks again!

@annanana Hi, how do I print the same for classification ?
print some examples from the val set in the form input_text true_label predicted_lable

Hi I am trying to Implement ULMFIT for kaggles toxic comments dataset. Get it to run but feels like I am doing something wrong since the predictions aren’t that good. I could be that the dataset is very biase against giving it class none and therefore the accuracy looks got but it is not?

pdf in git

Does “RNNLearner” not have method “language_model” any more? Should we use this "
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)" intstead? Thank you,