I am doing a POC on replicating Devise Idea that you shared in Part 2 for E-Commerce queries and product data / images.
I have gotten pretty good perplexity by using the ULMFIT and doing the transfer learning .
Even though I am seeing pretty good perplexity and I am able to predict the next word for certain queries, the model is not doing a good job on some frequently occurring queries. I suspect the reason might be because of the way I have selected my training data, but a little confused on what is the right thing to do.
The way I have the trained the Language Model is as follows.
- Identified distinct queries from the search logs with their counts.
- Thresholded the queries only if the query occurred at-least k times.
- Trained the language model on ~ Top 3 million queries. This has a vocab of ~20,000 words which is still not too bad on the Softmax .
The main part I am confused about is, I am currently throwing away the counts of the queries just to speed up the model learning time.
If I just repeat the queries based on their counts, Eg: “tvs” query lets say has been searched for 100K times, then the total training data will be bloated and will take lot of computation time. I can just do the preprocessing like numericalizing queries only once, but each epoch for the model will be lot time.
So can you please suggest on how I should deal with repetitions in Language Modeling? Is it like I am throwing away valuable statistical data by just considering the distinct queries? I tried reading online, but could not find good resources.