I have several features. The most important features are shown in the figure.
My question:
What would be the best way, to find the correct label? It is a multi-class categorization problem (~40 different labels).
Challenge:
The feature “Description” can be written in so many different ways. The sign of “Amount” and the “Date” can be a good indicator, whether this transaction is a salary or a purchase.
How would you tackle this problem? Do you have any good links? Is there a similar problem that was solved using fastai library?
In the “Description” we have a lot of text information. Do we need NLP text classification here? If yes, how can we also additionally consider the other features like “Date” and “Amount”? Otherwise we lose information.
Few ideas (sent from my phone, so apologies for any brevity/spelling):
You could try building two networks - one NLP to do a preliminary classification based on just the free text, then using that as input into a regular FC NN along with your other features.
Is there any correlation between the ID and label? It appears not, but just checking.
I only have the features shown in the figure above. That means, I have the (sometimes cryptic) description of the the purchase given by the retailer (Vodafone, Ueber, Restaurant,…), the date and the amount.
Is there any correlation between ID and label?
No, there is no correlation!
You said, I should try building two networks.
Multi-label text classification (using fastai):
FC NN
Let’s start with (1). I have some fundamental questions. Is sentiment analysis with multiple labels (positive, neutral, negative, very negative) some kind of multi-label text classification? If yes, I would have a good starting point with the following fastai videos:
Features
Gotcha – in that case, I don’t have any other thoughts on engineering something else out of them.
Sentiment Analysis
I’m no expert here (in fact, consider me an eager apprentice at best), but I’d suggest starting with lesson 4 of the Practical Deep Learning for Coders series. If you’re looking at diving in, I might start there, then start digesting Rachel’s amazing NLP course. ULMFIT was SOTA when released and still does amazingly well at classification. Not sure how it’ll do predicting 40+ labels, but worth a try.
FC NN
I haven’t done it personally, but the pipeline should be do the NLP classification, append the prediction to the original data frame, drop the description column, and train the new network that way. Others might have a more logical approach, but that’s what I got.
Hopefully this is helpful – please keep me posted on how it goes!