I am trying to test whether some adjustments to pre-trained word vectors will improve their usefulness in predictive models. @jeremy and @sebastianruder and all other experts - what best practice tips do you have for testing for small differences in outcomes of deep learning models?
I’ve been experimenting a bit more with word vectors following the discussion on this forum thread:
There are quite clear vectors for transforming a word to initial caps, and all caps, in addition to the spelling and pretentiousness vectors already mentioned. All these vectors share some features, which appears to be because they all relate to word frequency to some degree (bad spellings, all caps, more pretentious adjectives are all less frequent than good spellings, etc). I’ve checked this by running linear regression with the embedding matrix as features and log(word index number) as the y values, and the resulting coefficients share similar features with these vectors. Also, as an indication of how well frequency information is encoded in the embeddings, the R2 is 0.75 on the entire embedding matrix, and 0.92 on the top 250,000 words.
Based on a hypothesis that this information about word frequency, spelling, capitalisation, etc could distract a model from the actual meaning of words, I’ve been trying to test whether removing these aspects of the word embeddings will improve a model’s performance. My approach for removal is simple: subtract from each word vector its projection onto the relevant transformation vector; move on to the next transformation vector and repeat.
So far I’ve tried these adjusted embeddings on IMDB and SST datasets using a fairly standard BiLSTM model. Keeping everything else constant, the adjusted embeddings version seems to give very slightly higher accuracy on the test sets.
I’ve tried about 10 different sets of hyper-parameters, and then trained each model for the same number of epochs, checking test set accuracy after each epoch. The results vary, with the improvement in accuracy ranging from about 0 to 1% (i.e. test set accuracy might increase from 88.2% to 88.5%), but so far the results have never been worse using the adjusted embeddings.
So, this all looks interesting. My question is how I should go about investigating the relationship more thoroughly - are there any specific things I should / should not be doing; any datasets that it would be worth testing this out on; specific model architectures and hyper-parameters that I should use; etc?