I spent my weekend working on improving Stack Roboflow. I updated the display of the generated code so it looks more natural (removed a bunch of whitespace noise that was a result of the tokenizer).
I also linked it up with Elasticsearch (search engine is live here for data exploration!) so I could start understanding what it’s outputting. One of the most interesting things I found was that certain terms from the training data are over-represented and others are under-represented in the language model’s output.
There’s a slight bias towards under-sampling a term vs oversampling it:
After digging in a little bit it seems that terms which are common in both the wikitext dataset and my own training set tend to be over-sampled. And ones that are primarily present in my dataset are under-sampled. My hypothesis is that this has to do with transfer-learning.
For example, most oversampled (weighted by frequency of occurrence are:
And most under-sampled are
I noted some more details about my findings in this twitter thread.