To get a baseline I would try the simple approach proposed by Jeremy in Lesson 9 (use multi-category classification with a sigmoid loss function ( MultiCategoryBlock
with BCEWithLogitsLossFlat) instead of softmax ( CategoryBlock
with CrossEntropyLossFlat).
This is discussed in Handle data that belongs to classes not seen in training or testing.