Research methodology: best practice when testing for a possible relationship

I am trying to test whether some adjustments to pre-trained word vectors will improve their usefulness in predictive models. @jeremy and @sebastianruder and all other experts - what best practice tips do you have for testing for small differences in outcomes of deep learning models?

I’ve been experimenting a bit more with word vectors following the discussion on this forum thread:

There are quite clear vectors for transforming a word to initial caps, and all caps, in addition to the spelling and pretentiousness vectors already mentioned. All these vectors share some features, which appears to be because they all relate to word frequency to some degree (bad spellings, all caps, more pretentious adjectives are all less frequent than good spellings, etc). I’ve checked this by running linear regression with the embedding matrix as features and log(word index number) as the y values, and the resulting coefficients share similar features with these vectors. Also, as an indication of how well frequency information is encoded in the embeddings, the R2 is 0.75 on the entire embedding matrix, and 0.92 on the top 250,000 words.

Based on a hypothesis that this information about word frequency, spelling, capitalisation, etc could distract a model from the actual meaning of words, I’ve been trying to test whether removing these aspects of the word embeddings will improve a model’s performance. My approach for removal is simple: subtract from each word vector its projection onto the relevant transformation vector; move on to the next transformation vector and repeat.

So far I’ve tried these adjusted embeddings on IMDB and SST datasets using a fairly standard BiLSTM model. Keeping everything else constant, the adjusted embeddings version seems to give very slightly higher accuracy on the test sets.

I’ve tried about 10 different sets of hyper-parameters, and then trained each model for the same number of epochs, checking test set accuracy after each epoch. The results vary, with the improvement in accuracy ranging from about 0 to 1% (i.e. test set accuracy might increase from 88.2% to 88.5%), but so far the results have never been worse using the adjusted embeddings.

So, this all looks interesting. My question is how I should go about investigating the relationship more thoroughly - are there any specific things I should / should not be doing; any datasets that it would be worth testing this out on; specific model architectures and hyper-parameters that I should use; etc?

1 Like

In order to test for small differences, I would conduct multiple runs with the same hyperparameters and different seeds and test for statistical significance e.g. using Student’s T-test. That’s still not that often done in ML but should be the norm IMO. I think that’s particularly important if the improvement is consistent but small as in your case.

In addition to using them in lieu of pre-trained embeddings and let the model fine-tune them, I would also compare them with pre-trained embeddings while keeping them fixed to further control for the effect of fine-tuning.

Finally, it would be interesting to test the embeddings on a more syntactic task, for instance part-of-speech tagging on the Penn Treebank. For more inspiration, have a look at Section 5 of this paper; they do a nice job of quantifying what their embeddings capture.

4 Likes

Two more thoughts:

  • If you’re talking about IMDb accuracy, 88% is pretty low. Often apparently good ideas don’t actually help when used with stronger models - so be sure to use a strong baseline (>93% for IMDb)
  • If an improvement is so marginal as to be almost imperceptible, it’s probably best to spend your time thinking about how to find a stronger effect (e.g. is there another dataset it’s more suited to? some tweaks that could help?) rather than trying to measure the small effect.
1 Like