Multi Label Classification with ULMFIT, Category List is the wrong size?

calmdownkarm · June 27, 2019, 4:55pm

Hey
I’ve been trying to use ULMFIT for multi-label classification, I collected my labels into a comma-separated list for each row and fed it into a TextDataBunch like so

data_clas  = TextClasDataBunch.from_csv('./', 'collected_combined_data.csv', valid_pct=0.3, vocab=data.vocab, text_cols='comment', label_cols='mot', label_delim=',', bs=16)

And the model looks kinda right, except that the final layer is 50x82 - I don’t understand how it got to 82, since I only have 21 classes, ideally, it should just take the sigmoid over a final layer of 50x21?

My preds and y_true from model.validate are the same shape (3765x82) but 1. how did it reach the number 82 in the first place and 2. how do I interpret the preds to get the original labels?

RNNLearner(data=TextClasDataBunch;

Train: LabelList (8782 items)
x: TextList

y: MultiCategoryList
['class1'; 'class2'],['class3']...
Path: .;

Valid: LabelList (3765 items)
x: TextList
<text>
y: MultiCategoryList
<classes in the same format>
Path: .;

Test: None, model=SequentialRNN(
  (0): MultiBatchEncoder(
    (module): AWD_LSTM(
      (encoder): Embedding(4616, 400, padding_idx=1)
      (encoder_dp): EmbeddingDropout(
        (emb): Embedding(4616, 400, padding_idx=1)
      )
      (rnns): ModuleList(
        (0): WeightDropout(
          (module): LSTM(400, 1152, batch_first=True)
        )
        (1): WeightDropout(
          (module): LSTM(1152, 1152, batch_first=True)
        )
        (2): WeightDropout(
          (module): LSTM(1152, 400, batch_first=True)
        )
      )
      (input_dp): RNNDropout()
      (hidden_dps): ModuleList(
        (0): RNNDropout()
        (1): RNNDropout()
        (2): RNNDropout()
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=82, bias=True)
    )
  )
), opt_func=functools.partial(<class 'torch.optim.adam.Adam'>, betas=(0.9, 0.99)), loss_func=FlattenedLoss of BCEWithLogitsLoss(), metrics=[], true_wd=True, bn_wd=True, wd=0.01, train_bn=True, path=PosixPath('.'), model_dir='models', callback_fns=[functools.partial(<class 'fastai.basic_train.Recorder'>, add_time=True, silent=False)], callbacks=[RNNTrainer
learn: ...
alpha: 2.0
beta: 1.0], layer_groups=[Sequential(
  (0): Embedding(4616, 400, padding_idx=1)
  (1): EmbeddingDropout(
    (emb): Embedding(4616, 400, padding_idx=1)
  )
), Sequential(
  (0): WeightDropout(
    (module): LSTM(400, 1152, batch_first=True)
  )
  (1): RNNDropout()
), Sequential(
  (0): WeightDropout(
    (module): LSTM(1152, 1152, batch_first=True)
  )
  (1): RNNDropout()
), Sequential(
  (0): WeightDropout(
    (module): LSTM(1152, 400, batch_first=True)
  )
  (1): RNNDropout()
), Sequential(
  (0): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.2)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=82, bias=True)
    )
  )
)], add_time=True, silent=None)

calmdownkarm · June 29, 2019, 9:52am

I figured this out yesterday - turns out that I had created a dataframe with my labels as a list, which when I wrote them to a csv file for the datablock was causing problems because “[‘label-name’]” was seen as a string. writing it to csv as a comma seperated string should fix the problem, though I just used the from df factory method instead.