As far as I have understood the use of Transfer Learning for NLP has two main benefits:
- general aspects of e.g. the english language do not have to be learned from scratch
- far less data is necessary to train a model for a specific task
Especially regarding the second point I am asing myself, whether it makes sense to use sth. like BERT, ULMFit, XLNet if a large dataset is available. I want to perform text classification (binary) and have about 2 million samples.
If I fine-tune a Language Model (LM) with this high number of samples, does this not destroy the already learned weights of the LM?
Thank you in Advance!