Background: This app currently uses Tensorflow , but I’m in the process of switching everything to fastai and implementing lots of things I learned in the latest versions of this class. TLDR; this is a multi-class (really, multi-label) text classification problem.
Plan/ideas for improvements:
- I was previously using softmax, but as Jeremy repeated many times only use this if you are sure labels are mutually exclusive and in theory, are not. (even though they are artificially so in the dataset).
- Since labels are not necessarily mutually exclusive going to try the label-smoothing technique that was introduced.
- Going to start from weights from the pre-trained language model (wiki-text).
- I wanted to give a go at “few shot learning” by using the representations of the language model to see if I could do a nearest neighbor lookup to find similar issues to predict labels even if there are not that many labels to begin with. If anyone has any ideas here please let me know.
- While I’m at it, I can give a shot at detecting duplicate issues.
I have two text fields issue title and issue body, I could just concatenate them together and place field markers like xxTitle and xxBody kind of like is done in the course. Or I could train two separate encoders with shared vocabulary for each and try to merge them at the end. Any opinion on what to try first? I might just try both to see what happens.
Are there any features people can think of? For example, thought about adding a “repository embedding” but wasn’t convinced this was a great idea because most of the repos that will install this app are unseen. Right now, the features are issue titles and the issue body.
Are there any other creative things that people can think of that I’m missing?
The repo i’m working in is 100% open source and here incase anyone is curious. I thought I should throw this out here because this is the smartest community of people that I know of. Thanks