I have a dataset which contains puchasing data. It looks similar to this:
I have several features. The most important features are shown in the figure.
What would be the best way, to find the correct label? It is a multi-class categorization problem (~40 different labels).
The feature “Description” can be written in so many different ways. The sign of “Amount” and the “Date” can be a good indicator, whether this transaction is a salary or a purchase.
How would you tackle this problem? Do you have any good links? Is there a similar problem that was solved using fastai library?
In the “Description” we have a lot of text information. Do we need NLP text classification here? If yes, how can we also additionally consider the other features like “Date” and “Amount”? Otherwise we lose information.
Thank you very much!
First, welcome to the fastai community!
Few ideas (sent from my phone, so apologies for any brevity/spelling):
- You could try building two networks - one NLP to do a preliminary classification based on just the free text, then using that as input into a regular FC NN along with your other features.
- Is there any correlation between the ID and label? It appears not, but just checking.
- What are the other features?
Thank you very much for your response!
What are other features?
- I only have the features shown in the figure above. That means, I have the (sometimes cryptic) description of the the purchase given by the retailer (Vodafone, Ueber, Restaurant,…), the date and the amount.
Is there any correlation between ID and label?
- No, there is no correlation!
You said, I should try building two networks.
- Multi-label text classification (using fastai):
- FC NN
Let’s start with (1). I have some fundamental questions. Is sentiment analysis with multiple labels (positive, neutral, negative, very negative) some kind of multi-label text classification? If yes, I would have a good starting point with the following fastai videos:
Sentiment Classification with Naive Bayes (NLP video 4)
Sentiment Classification with Naive Bayes & Logistic Regression, contd. (NLP video 5)
Naive Bayes works in every language, so this would be a good start in my opinion. Maybe ULMFIT (video 10) in this playlist, is an other good approach?
Do you have any further useful links that could guide me through the first step?
Now, let’s quickly discuss the second point (2):
I am not sure how to connect the result / model of (1) to the second step. Do you have any advise on this?
Hope you can help me!
Gotcha – in that case, I don’t have any other thoughts on engineering something else out of them.
I’m no expert here (in fact, consider me an eager apprentice at best), but I’d suggest starting with lesson 4 of the Practical Deep Learning for Coders series. If you’re looking at diving in, I might start there, then start digesting Rachel’s amazing NLP course. ULMFIT was SOTA when released and still does amazingly well at classification. Not sure how it’ll do predicting 40+ labels, but worth a try.
I haven’t done it personally, but the pipeline should be do the NLP classification, append the prediction to the original data frame, drop the description column, and train the new network that way. Others might have a more logical approach, but that’s what I got.
Hopefully this is helpful – please keep me posted on how it goes!
Thank you very much for your help. I will defently keep you updated once I have access to the data and started trying some of your suggestions.