Tuning hyper-parameters with bayesian optimization

Hi guys, I found this awesome library scikit-optimize that helps a lot for doing bayesian optimization and made some notebooks while trying it here.

1 Like

Hii can we replicate the same for tuning hyper-parameters in fastai library?

Absolutely! I’ve played around with it myself. This is the library:

So long as you set the hyperparameters to your choosing, just fit the training loop you want to run in it, and you’re off to the races. If you are confused I can show a notebook.

Hey, Please share the notebook in which you did the same with fastai library. I am new to the library. Thanks in advance!

Sure, give me a moment to clean up the notebook.

1 Like

@msrdinesh here is a notebook. It is very basic, I just use the tabular problem as an example. Anything you want changed/optimized you do within the fit_with function. If you say wanted to mess with image sizes, make the databunch in the fit_with and have the image size as a hyperparameter. Let me know any questions!

4 Likes

Thankyou @muellerzr.

Hey Zachary, thank you for this, I had a few issues getting started but it seems to be working now. One thing that I wonder about is, have you explored the nature of the variance with regards to hyperparam optimization? We would prefer not to do full training, so the two weapons at our disposal for cutting down on time are

  1. use_partial_data to only use a subset of the data and speed things up
  2. train fewer epochs and hope the relationship between the hyperparams and accuracy holds.

Obviously there must be a point where your results are pure noise/variance, like 1% of data for 1 epoch, but do you have any sense of what amount of data and epochs you need to get to get consistent results that actually correlate with higher accuracy in full training?

I wrote a small grid search library and one function it had was a variance test. It would run a single set of params n times using the decided upon amounts of data and epochs, and would then show you the range of results that came back. If the accuracy was more or less the same for every run, you were good to go, if there were 5% swings in accuracy, you probably weren’t going to learn anything and had to increase data % and epochs. Any ideas for this re: bayesian optimization?

I’d imagine a good size would be around 10% of a random subset or so (if you have enough data to do so and it’s reasonable). So long as it’s representative I believe you should be okay. I have not tested using a subset myself for this, but for instance this is done with permutation importance where we measure against a small-ish subset of the data, I imagine it should be applicable here for that same idea.

Also consider how long it normally trains to be a pretty decent model. For tabular it’s usually a few epochs at most, for images depending it can be ~10-15 or so sometimes. I’m not 100% sure how many you normally do for audio however. It’s something I need to start looking into myself for my research projects.

On the second note, that can be arranged, where just wrap it in a loop for five times, if it was good spit out x else either redo it or show a default value if that makes sense? Let me know if it doesn’t! (Or if I missed the mark!!!)

Yes that makes sense, thank you. In code, I could just do

accs = [fitwith(param1, param2...) for i in range(10)]
where param1/param2…etc are all just reasonable constant values for my parameters. Then I’ll have a list of 10 accuracies and can decide whether the variation in results is low enough to justify creating and running the optimizer with that % of my dataset and #of epochs.

Correct! Let me know how it goes @MadeUpMasters :slight_smile:

Interestingly enough I implemented this and found pretty substantial variation in results, but then I remembered I wasn’t seeding to keep things deterministic/reproducible, so I went back and did the same run with seeding and found much tighter grouping. If you don’t seed then the validation set will be different for each training run and will add tons of noise to your hyperparam optimization. Here is a recipe and results for anyone doing this. First add this code to make all the libraries seed properly.

#To have reproducible results with fastai you must also set num_workers=1 in your databunch, and seed=seed in split_by_rand_pct
seed = 42
# python RNG
import random
random.seed(seed)
# pytorch RNGs
import torch
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
# numpy RNG
import numpy as np
np.random.seed(seed)

I’m not sure if proper seeding implies identical results on each run but that’s not what I got. My first run with no seeding yielded the following over 10 runs of 10 epochs with 20% of my data.

Results of 10 runs with constant arguments:
Min: 0.5343915224075317, Max: 0.6005290746688843, Mean: 0.564550256729126, Std: 0.021498664550225415

[0.5343915224075317,
 0.5820105671882629,
 0.5449735522270203,
 0.5608465671539307,
 0.5846560597419739,
 0.5661375522613525,
 0.5846560597419739,
 0.5476190447807312,
 0.5396825671195984,
 0.6005290746688843]

Then I ran it again with the seed cell, seeded random train/valid split, and num_workers=1 and got these results…

Min: 0.5343915224075317, Max: 0.5687830448150635, Mean: 0.5526455044746399, Std: 0.012263835528481164
[0.5608465671539307,
 0.5423280596733093,
 0.5608465671539307,
 0.5608465671539307,
 0.5396825671195984,
 0.5343915224075317,
 0.5370370149612427,
 0.5687830448150635,
 0.5555555820465088,
 0.5661375522613525]

This second grouping is probably tight enough that I can start Bayesian Optimization with that data_pct and number of epochs and get results that aren’t just noise.

I’ll keep playing with this and eventually post a full nb here for doing this in fastai, probably as part of a tutorial for our fastai audio module.

2 Likes

Hii is there any other better hyper parameter optimiser other than Bayesian?

I tried training the titanic dataset with fastai defaults, Bayesian tuning and XGBoost. Of these the accuracy with bayesian tuned hyp params and the fastai default params almost gave the same results(82%) but the one with XGBoost gave a higher accuracy(83%)

Thanks,

Hi @muellerzr and anyone else interested. I have stumbled on an error implementing your code in the context of my segmentation problem. I wondered whether you could possibly give me a push in the right direction?

I made a new question here:

Thanks!