Repeated documents / sentences in Language Modelling

Hi @sebastianruder @jeremy ,

I am doing a POC on replicating Devise Idea that you shared in Part 2 for E-Commerce queries and product data / images.

I have gotten pretty good perplexity by using the ULMFIT and doing the transfer learning .

Even though I am seeing pretty good perplexity and I am able to predict the next word for certain queries, the model is not doing a good job on some frequently occurring queries. I suspect the reason might be because of the way I have selected my training data, but a little confused on what is the right thing to do.

The way I have the trained the Language Model is as follows.

  1. Identified distinct queries from the search logs with their counts.
  2. Thresholded the queries only if the query occurred at-least k times.
  3. Trained the language model on ~ Top 3 million queries. This has a vocab of ~20,000 words which is still not too bad on the Softmax .

The main part I am confused about is, I am currently throwing away the counts of the queries just to speed up the model learning time.

If I just repeat the queries based on their counts, Eg: “tvs” query lets say has been searched for 100K times, then the total training data will be bloated and will take lot of computation time. I can just do the preprocessing like numericalizing queries only once, but each epoch for the model will be lot time.

So can you please suggest on how I should deal with repetitions in Language Modeling? Is it like I am throwing away valuable statistical data by just considering the distinct queries? I tried reading online, but could not find good resources.

Thanks,
Abilash.

Hey,

I don’t have much experience with repetitions for training language models, as for most domains it is possible to get large amounts of non-repetitive unlabelled data fairly easily. For language modeling, I think the same considerations apply as for training an ML model in general, e.g. you want your training data not to be overly skewed, the train distribution should be similar to your test distribution, etc.

I think your overall strategy is sound. You don’t want to show the model the same example 100k times. However, as the model is currently struggling with some of the frequent queries, I think it makes sense to show more frequent examples more frequently during training. I don’t think you want to do this linearly; instead for instance, upsampling based on some log frequency might make sense, e.g. if one query is 100 times more frequent, you could show it 2x as often during training. I would try different distributions to see what works best.

Hi @sebastianruder Thanks a lot for your reply and the suggestions.

As of now, I am training another model, by doing Upsampling similar to the strategy that you mentioned but slightly different. I am taking the minimum threshold and dividing the counts by the min and round them off, so that the magnitude of frequent queries vs less frequent queries will be preserved. That gave me a manageable training data size. I will also try the log frequency based sampling .

In the traditional statistical Ngram language modelling, since its all about the frequency of ngrams and their corresponding probabilities, we can just multiply the word frequency in a query by the query frequencies. But I was confused what to do in the Neural Language Model.

Thanks again for the suggestions. Will update on the results .

–Abilash.

That’s another good approach and should also work.

I’m looking forward to hearing about the results!

Instead of repeating the queries multiple times, could you pass in observational weights which are higher for more frequently occurring queries and lower for less frequently occurring queries to differentiate the influence of the different queries in the loss function?

That is a great suggestion Patrick. I will give your idea of having a custom loss function with weighted training examples a try as well.

Something to keep in mind is that gradients will scale with the size of the observation weights since pytorch’s .backward() takes the average gradient based on observation counts and not weights. So if your weights are a lot larger on average than 1.0, you might need to decrease the learning rate to prevent too big of update steps.