I have successfully fine-tuned the pretrained WikiText-103 model for English on a custom data set and used it to create a custom classifier that has good accuracy. This is similar to the IMDB classifier in terms of the training (course-nlp/review-nlp-transfer.ipynb at master · fastai/course-nlp · GitHub)
There are some good examples of training new languages from scratch. For example course-nlp/nn-turkish.ipynb at master · fastai/course-nlp · GitHub shows how to train a Turkish LM and then use it to classify some data.
So I think this could be a valid approach to classify some code samples in a similar fashion. I have looked but didn’t find any pre-trained software language models. (If I missed them, would be greatly useful!) My current plan is to create a language model from a large data set and then fine-tune a classifier based some sample code snippets in a specific domain. Probably starting with just a few thousand files for the LM and then working up if the approach seems promising.
Does this approach seem reasonable? Does anything jump out as things to watch out for or things to avoid?
I plan to try a couple of different tokenizers, since I have seen a couple of papers that talk about different tokenization being useful (including breaking down variable names into subpieces to avoid 'out of vocabulary" OOV issues). I’m not sure if a language-specific parser or tokenizer is also helpful.
FYI to anyone else searching on the topic of Machine Learning on Source Code: I have found this site to be super useful - https://ml4code.github.io/