Brendan,
I think that performance of pillow-simd is limited by the read times from the hard disk.
I have seen this happen in case of my cuda as well (it is starved of data at times inspite of having multiple num_workers in play).
jeremy is using a ssd for his data…so the read times are super-fast and hence pillow-simd is effective.
I think it would be interesting to try this idea on mathematics! I don’t think this has been done yet but if I am wrong I would love to read about it. A language model that is capable of completing maths could be useful for many things; classifying correct and incorrect reasoning in human problem solving of for example math questions or accounting and such. I am sure there would be many other creative applications of such a model, and to me, it seems extremely similar to any other language model creation.
Perhaps the same could be done with programming languages. It could be used for refactoring code, making it more readable or even to create documentation of the code. It would even be possible to identify code that is voulnerable to exploitation or bugs and maybe even potentially fix it.
When converting a language model to a classifier model, it seems only weights for tokens in the imdb corpus is loaded:
new_w = np.zeros((vs, em_sz), dtype=np.float32)
for i,w in enumerate(itos):
r = stoi2[w]
new_w[i] = enc_wgts[r] if r>=0 else row_m
wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))
wgts['1.decoder.weight'] = T(np.copy(new_w))
Presumably the wiki model has the larger vocabulary, and this is for the sake of efficiency. But in the subsequent conversion to classifier, the itos from the imdb lm is used:
itos = pickle.load((LM_PATH/'tmp'/'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)
If the corpus of your classifier has tokens that are not in the lm data, you’d lose these tokens as ‘unknown’.
Wouldn’t it be better to either use all the wiki103 vocabulary weights, or combine the lm and classifier corpuses for training, whichever is possible?
When fine tuning the LM it is said that - “We first tune the last embedding layer so that the missing tokens initialized with mean weights get tuned properly. So we freeze everything except the last layer.”
In the code this is done with the following line of code:
learner.freeze_to(-1)
According to my understanding learner.freeze_to(-1) means unfreezing the top most layer which is not the embeddings layer. The embeddings layer is the bottom most or the first layer so I would expect to see learner.freeze_to(0).