Confusion about output layer activation function for multi-label classification

itslwg · December 10, 2020, 12:07pm

I am doing mulit-label classification for tabular data (implemented using the multi-label helper in the fast.ai documentation) and the model is performing great! But, when I inspected the model further, i noticed that the output layer is nothing but a linear transform rather than using a Sigmoid activation function. More specifically, the architecture is

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(25, 10)
    (1): Embedding(9, 5)
    (2): Embedding(33, 11)
    (3): Embedding(32, 11)
    (4): Embedding(8, 5)
    (5): Embedding(207, 32)
    (6): Embedding(3, 3)
    (7): Embedding(3, 3)
    (8): Embedding(3, 3)
    (9): Embedding(3, 3)
    (10): Embedding(3, 3)
    (11): Embedding(3, 3)
    (12): Embedding(5, 4)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(314, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(410, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=410, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=4, bias=True)
    )
  )
)

I noticed that Jason Brownlee wrote here that for multi-label classification the activation function for the output layer should be a nn.Sigmoid(). I tried replacing the last layer with a LinBnDrop(act=nn.Sigmoid()) layer, but then i get no improvement in validation loss and worse performance overall.

Why is it that that tabular_model implements a linear output layer automatically for multi-label classification? And why am i not seeing any improvement when switching the last layer to use sigmoid as activation?

itslwg · December 16, 2020, 4:47pm

Inspecting the model loss function, it seems to utilise

FlattentedLoss of BCEWithLogitsLoss

which is a bundle of a Sigmoid layer and BCELoss (Binary cross-entropy loss). Hence, it does apply a sigmoid activation using the output from last layer.