Using 100% of the training data for training

chrwittm · January 4, 2023, 7:55am

In the notebook “Iterate like a grandmaster”, Jeremy suggested for further improvement: “Before submitting a model, retrain it on the full dataset, rather than just the 75% training subset we’ve used here.” I wonder, how would I actually do that?

So far my understanding is that we train only on a subset (the training data) to be able to see how good we are doing on the validation set. We do this to be able to adjust our parameters in gradient descent. Maybe to ask the question a little differently: When training on 100% of the data, what would be the validation set?

Thanks for your thoughts on this topic

Christian

BobMcDear · January 4, 2023, 2:00pm

Hello,

There would no longer be a validation set - you would train the model on the entirety of the dataset and submit it directly to Kaggle. If you are satisfied with the model’s performance when trained on a smaller portion of the data and validated on the rest, chances are, holding all else equal, it’d do even better once exposed to the full dataset. Alternatively, you may regard Kaggle’s public test set as your validation set, but beware that a high score on the public leaderboard does not necessarily translate into a high score on the private leaderboard.

Redevil · January 5, 2023, 7:47pm

I hope I’m not going off topic, I would like to know how to tell the DataLoader not to use any validation set.

BobMcDear · January 5, 2023, 9:31pm

Are you using convenience methods such as ImageDataLoaders.from_folder to construct your DataLoader? If so, you could set valid_pct to 0.

Redevil · January 5, 2023, 11:12pm

I am using the DataBlock structure:

dset = DataBlock(blocks = (ImageBlock(), CategoryBlock),
                 getters =[ColReader('img_path'),
                           ColReader('cancer')], 
                 splitter=ColSplitter('is_valid'),
                 batch_tfms=tfms
       )

dls = dset.dataloaders(df, bs=BS)

I have also tryied to set the entire dataframe’s column is_valid to True, but it is not permitted.

ForBo7 · January 6, 2023, 5:01am

You can simply omit the splitter parameter in the DataBlock class.

Redevil · January 6, 2023, 4:32pm

@ForBo7 Thanks
I tried to remove the splitter , but it still takes a portion of the dataset for validation.

chrwittm · January 9, 2023, 6:44am

@BobMcDear Thanks for clarifying and also thanks to @Redevil for starting the discussion on how to actually implement using 100% of the training data in code.

I am currently working on lesson 4 and therefore I use the Hugging Face Trainer (same as the learner in Fast.AI). I could successfully show the model the full training set after a few epochs on “normal training rounds” by doing the following: (and it actually improved the result significantly)

    if train_all:
        #get the model which was trained before on 75% of the training data
        model = trainer.model
        #create a new test set since a test will be required by the trainer
        dds = tok_ds.train_test_split(0.25, seed=seed) 
        # one more epoch with all the data
        epochs=1

        args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
            evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
            num_train_epochs=epochs, weight_decay=0.01, report_to='none', gradient_accumulation_steps=1,
                                save_steps=-1)

        # pass the full dataset as training data and the test split as validation data
        trainer = Trainer(model, args,
                          train_dataset=dataset,
                          eval_dataset=dds['test'],
                          tokenizer=tokz)

        trainer.train()

Question: Is this the intended way of training on the full training set, i.e. to first do the training “as usual” and then run another epoch with the full training data? Or would you suggest a different approach?

Redevil · January 9, 2023, 11:04am

Hi @chrwittm
When you train the model with the full dataset, you have to “reset” the (weights of) previous trainer.

The strategy should be this:

Tune the hyperparameters exploiting the validation set, i.e. the “partial” dataset.
Use the previously calculated (step 1) hyperparams for training the “raw” model on the full dataset.

About the second step, you could try to decrease the learning rate parameter a bit

chrwittm · January 13, 2023, 4:43pm

Hi @Redevil
how would I actually do this in code (i.e. resetting the weights)? Would I go back to the pre-trained model I started with? Wouldn’t I loose the learnings from the first round of training? Which information gets carried over into the training session with the full dataset?
Sorry for asking so many questions…
Thanks,
Christian

Redevil · January 14, 2023, 8:35am

Hi @chrwittm

You can create a new notebook, using the same network architecture and hyperparameters.
Yes, but you are going to re-train the network with the same hyperparameters, therefore the network learns the same “information” from the former training set, plus the “information” coming from the former validation set.
No informations are carried over, it happens in transfer-learning and finetuning instead.

** You can use the Fastai’s set_seed() function in order to obtain results reproducibility: same network, with the same hyperparams, will gives you always the same results.

abhivij · August 30, 2024, 7:06am

Hi @Redevil

Were you able to figure out how to specify training on the full data, (i.e. no validation data), using DataBlock structure ?

Not specifying splitter, still seems to allocate some data to validation

abhivij · August 30, 2024, 11:39am

The following assignment helped me create a dataloader with 100% of training data for training, using DataBlock structure

splitter = IndexSplitter(np.array([ ]))