K-Folds vs One Model?

Sure @mgloria, again it’s unfinished I need to fix a few things which I will get to in a few hours, but essentially you would do the following:

val_loss, acc, final_results = StratifiedFit(train_data, test_data, n_folds=10, epochs=5, callback_fns=[EarlyStoppingCallback])

I need to adjust a few things but that’s the gist. If we wanted a list we’d adjust accordingly which I will do as well.

For reproducability, here on the forum is a small snippet to have exact reproducability, but we want that varience do we not? Varience is real and can help some times.

@muellerzr do you think this could included in fastai v2 version? I was just revisiting this thread (and your fasntastic code) today and thought it may also be useful to many others.

@mgloria I’m not the biggest fan of having EarlyStopping (due to training can vary and I want that variance) and this thread exists for people to find if they wanted to do so :slight_smile: if there’s more want for it I’ll make a notebook with it.

Sure! I meant mostly the K-folds approach. I think it is a great thing to have in fastai. I belive that the default split methods are not stratified (correct me if I am wrong), this can be a problem for imbalanced datasets with many classes…

Oh! @mgloria Sure :slight_smile: I have a notebook detailing this! (Go look at the Practical Deep Learning for Coders 2.0 thread I made :slight_smile: ) here: A Guided Walk-through of 2.0 (Like Practical Deep Learning for Coders)

That specific notebook is here https://github.com/muellerzr/Practical-Deep-Learning-for-Coders-2.0/blob/master/03b_kfold.ipynb

1 Like

Awesome!!!

2 Likes

Hi, everyone! Thank you for the discussion @mgloria and @muellerzr, it was extremely useful.
I am implementing a time series cross validation, and would like to know if you have some advices for me?
I am currently using the TimeSeriesSplit from scikit-learn and the following code:

    model = model_builder(model_args=MODEL_ARGS)
    model, device = set_device_to_train(model)

    tscv = TimeSeriesSplit(n_splits=5)

    for train_index, val_index in tscv.split(X_train_val):
        
        # data split based on time series split
        X_train, X_val = X_train_val[train_index], X_train_val[val_index]
        y_train, y_val = y_train_val[train_index], y_train_val[val_index]


        # make data read for the model
        # make data ready for model
        train_ds = TensorDataset(torch.tensor(X_train).float(), torch.tensor(y_train).unsqueeze(1).float())
        valid_ds = TensorDataset(torch.tensor(X_val).float(), torch.tensor(y_val).unsqueeze(1).float())

        # Above functions is the same as doing this - maybe we should decrease the training size
        dls = DataLoaders.from_dsets(train_ds, valid_ds, bs=MODEL_ARGS['batch_size'])

        CBS_ = [
        EarlyStoppingCallback(patience=PATIENCE),
        SaveModelCallback(fname=f'{TARGET}_model'),
        ]

        # Train the model
        learn = Learner(dls, 
                        model, 
                        loss_func=loss_function_builder(MODEL_ARGS['loss_function']), 
                        opt_func=optimizer_builder(MODEL_ARGS['optimizer']),
                        metrics=metrics_builder(MODEL_ARGS['metrics']),                        
                        # Later we can check for more callbacks if needed
                        cbs=CBS_
                        )
        
        learn.fit_one_cycle(MODEL_ARGS['epochs']) # should we try to define to do the lr_find() first

I would like to know how can I adapt this code to get the best model based on the average validation loss and save it with the Callback that I have defined in CBS_ variable.
Is there any straight forward solution for it?

Thank you so much in advance!