In the notebook “Iterate like a grandmaster”, Jeremy suggested for further improvement: “Before submitting a model, retrain it on the full dataset, rather than just the 75% training subset we’ve used here.” I wonder, how would I actually do that?
So far my understanding is that we train only on a subset (the training data) to be able to see how good we are doing on the validation set. We do this to be able to adjust our parameters in gradient descent. Maybe to ask the question a little differently: When training on 100% of the data, what would be the validation set?
There would no longer be a validation set - you would train the model on the entirety of the dataset and submit it directly to Kaggle. If you are satisfied with the model’s performance when trained on a smaller portion of the data and validated on the rest, chances are, holding all else equal, it’d do even better once exposed to the full dataset. Alternatively, you may regard Kaggle’s public test set as your validation set, but beware that a high score on the public leaderboard does not necessarily translate into a high score on the private leaderboard.
@BobMcDear Thanks for clarifying and also thanks to @Redevil for starting the discussion on how to actually implement using 100% of the training data in code.
I am currently working on lesson 4 and therefore I use the Hugging Face Trainer (same as the learner in Fast.AI). I could successfully show the model the full training set after a few epochs on “normal training rounds” by doing the following: (and it actually improved the result significantly)
if train_all:
#get the model which was trained before on 75% of the training data
model = trainer.model
#create a new test set since a test will be required by the trainer
dds = tok_ds.train_test_split(0.25, seed=seed)
# one more epoch with all the data
epochs=1
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
num_train_epochs=epochs, weight_decay=0.01, report_to='none', gradient_accumulation_steps=1,
save_steps=-1)
# pass the full dataset as training data and the test split as validation data
trainer = Trainer(model, args,
train_dataset=dataset,
eval_dataset=dds['test'],
tokenizer=tokz)
trainer.train()
Question: Is this the intended way of training on the full training set, i.e. to first do the training “as usual” and then run another epoch with the full training data? Or would you suggest a different approach?
Hi @Redevil
how would I actually do this in code (i.e. resetting the weights)? Would I go back to the pre-trained model I started with? Wouldn’t I loose the learnings from the first round of training? Which information gets carried over into the training session with the full dataset?
Sorry for asking so many questions…
Thanks,
Christian
You can create a new notebook, using the same network architecture and hyperparameters.
Yes, but you are going to re-train the network with the same hyperparameters, therefore the network learns the same “information” from the former training set, plus the “information” coming from the former validation set.
No informations are carried over, it happens in transfer-learning and finetuning instead.
** You can use the Fastai’s set_seed() function in order to obtain results reproducibility: same network, with the same hyperparams, will gives you always the same results.