Advice on a Text Classification Problem with fast.ai

Background: This app currently uses Tensorflow :man_facepalming:, but I’m in the process of switching everything to fastai and implementing lots of things I learned in the latest versions of this class. TLDR; this is a multi-class (really, multi-label) text classification problem.

Plan/ideas for improvements:

  • I was previously using softmax, but as Jeremy repeated many times only use this if you are sure labels are mutually exclusive and in theory, are not. (even though they are artificially so in the dataset).
  • Since labels are not necessarily mutually exclusive going to try the label-smoothing technique that was introduced.
  • Going to start from weights from the pre-trained language model (wiki-text).
  • I wanted to give a go at “few shot learning” by using the representations of the language model to see if I could do a nearest neighbor lookup to find similar issues to predict labels even if there are not that many labels to begin with. If anyone has any ideas here please let me know.
  • While I’m at it, I can give a shot at detecting duplicate issues.

Questions:

  • I have two text fields issue title and issue body, I could just concatenate them together and place field markers like xxTitle and xxBody kind of like is done in the course. Or I could train two separate encoders with shared vocabulary for each and try to merge them at the end. Any opinion on what to try first? I might just try both to see what happens.

  • Are there any features people can think of? For example, thought about adding a “repository embedding” but wasn’t convinced this was a great idea because most of the repos that will install this app are unseen. Right now, the features are issue titles and the issue body.

  • Are there any other creative things that people can think of that I’m missing?

The repo i’m working in is 100% open source and here incase anyone is curious. I thought I should throw this out here because this is the smartest community of people that I know of. Thanks :slight_smile:

2 Likes

We have had some success in adding metadata to ULMFiT in the first manner you describe (forum post here, so I think that for your first question, concatenating them together with field markers would work well.

As for adding other features, I can’t think of any that wouldn’t require you to merge a structured data model with the text model. e.g., I could see things like issue body length and if the issue body contains code as being useful indicators, but I don’t know if it’s worth the effort.

2 Likes

In fastai we add xxfld followed by a contiguous integer for each field. It seems to work fine. @sgugger how is that functionality surfaced at the moment?

2 Likes

The automatic way is when you have the texts corresponding to your fields in two columns of a dataframe, then you have to pass mark_fields=true to the factory method you’re using, or to your TokenizeProcessor in the data block API.

Another way is to define a rule to do this and add it to your TokenizeProcessor.

3 Likes

Great thanks! This is very helpful!