Tipps to improve multilabel text classification?

Hi everyone,

we are trying to do multi-label text classification on a legal dataset. So the documents are legal text, and they are classified with labels from a legal term taxonomy (EuroVoc).

The problem is quite hard, we have 17519 documents and 3391 classes in the dataset, and the class distribution is very imbalanced.
On average a document has 5.6 labels, but it ranges from 2 to 10. Some labels are used on hundreds of docs, some only on a few.

We tried two versions, one with preprocessing, and one without preprocessing. Preprocessing includes lowercasing, replacing dates with , etc. The preprocessed version works better.

We have trained a classifier with the standard fastAI V3 pipeline:
https://github.com/gwohlgen/colab/blob/master/basic_version_forum.ipynb
The code should be easy to understand, it uses the basic steps from the V3 course.

Our results with F1 of 0.55 are not so bad given the difficulty of the dataset. In detail:

F1 (micro) 0.55
P (micro) 0.58
R (micro) 0.52

F1 (macro) 0.14
P (macro) 0.16
R (macro)  0.15

However, a group using an SVM classifier and heavy preprocessing claims to have F1 of 0.61 back in 2012, unfortunately that group doesn’t provide much details (and no code) on what exactly they did.

Anyway, my question is: what would you try to improve the results in such a situation?

One problem we see is the class imbalance, which leads to low macro-F1 (and probably is a problem while training). We tried to used class-weights in training, but that doesn’t seem to be supported yet with fastAI, see here.

What else could we try?
Hope to get some feedback, cause imho this is quite an interesting topic for many.

Best, Gerhard