NLP like embedding instead of MultiCategoryBlock

rahuldave · December 31, 2020, 1:59am

I am training a tabular model for health prediction, and we have what we call a disease block of about 10-15 diseases with a disease severity as a ordinal. For example, one of the diseases in the block might be stroke, and you could have 0=none, 1=mild, 2=medium, 3=severe. Was originally thinking of a multicategory block to model, but code does not seem to be deal with the ordinality. Plus Multicategorical for a 15 disease block has 2^15 choices. Thats huge.

So i thought rather than do it that way, a nlp style vocab of 15 diseases, so that one patient might have something like this (kinda like Term Frequency but with Ordinals):

301011121000000.

Before i go off and write something, I was wondering if there is built in support for this?

florianl · December 31, 2020, 1:22pm

Why not treating your problem as a regression problem? I‘d give it a try using RegressionBlock(n_out=15) and set y_range=(-0.5,3.5) in the learner. Then treat your diseases in the Input as floats / continues values.

rahuldave · December 31, 2020, 1:54pm

Thats the same thing as Embedding, this is true, and then put the inputs in! Lovely! But i think pytorch embeddings can be directly used too, right? I think i just have to write the custom block. But was thinking this might be a common enough situation…

florianl · December 31, 2020, 2:55pm

hm maybe I don’t get your use case right but I don’t see why embeddings should help and how a block could look like. but It just could be me who doesn’t get it :). can you explain a bit more what’s your idea behind the embedding approach?

the advantage of treating the problem as a regression task imho is: you could just use standard fastai and train a model in three lines of code and a couple of minutes effort. that way you can quickly see if you model is able to predict your diseases. from this baseline you could try to improve your model and more important your whole pipeline.