I would suggest avoiding black-box hyperparam tuning wherever possible. I’ve only used it once in my life, and even then it wasn’t really a win. Instead, use the techniques we’ve learned to try carefully selected choices.
The techniques mentioned in Leslie’s report of using the val loss to check for early overfitting. I use this approach, but reading various earlier papers that used these and sometimes on kaggle also and various random mentions, I thought maybe it was that good. But thanks for clearing the doubt.
I use hyperparameters for all those pesky regularization parameters in gradient boosted trees a lot.
Hyperopt allows you to tune over model architectures which is nice, but I find it tends to not do as well as BayesianOptimization for true hyper parameters (depth of trees, column sampling, l1 and l2 regularization, etc.) But BayesianOptimization is not as flexible as Hyperopt and doesn’t have a nice way to handle integer-based hyperparameters (such as tree depth) or allow for priors on the parameter space outside of the uniform distribution.
BayesianOptimization is also there for hyperparameter tuning. But one thing that prevents me using these libraries is when using cyclic learning, and monitoring val loss for some iterations, you will be able to set most of the hyperparameter values quickly and often they are reliable even if you make a mistake (maybe due to superconvergence). And also mentioned in Leslis’s report most of these hyperparameters are tied to each other, so if I make a mistake of setting some hyperparameter I can compensate by setting some other hyperparam accordingly.
Another, practice that I have started is to change most of my hyperparams during training after some cycles. So if I use fit_one_cycle with model freezed and do some cycles. Then when I unfreeze the model, I again set new hyperparameter values. Sometimes, I like to continue it. Like after 5 cycles, I would again try to see for new values of some params, like lr, mom, wd.
The best I came across recently was a combination of evolution / genetic algorithm and BayesianOptimization. Essentially, you create n seed instances, let the optimzer run over each and those with the best score according to a fitness function make it into the next round. It’s a bit like evolution on steroids. it converges hyper-parameter reasonable fast, and even if it has not found the global max/min, it is usually not far away. The resulting accuracy is very good, although you never really know how or why you ended up with the results.
I thought Genetic algorithms had to go through many iterations before they got good. Doesn’t genetic approach become a serious problem in case where you have large models. Most of the times I have seen people using Genetic algos, they use like more than 50 candidates. I don’t know about BayesianOptimization much so maybe that makes Genetic algos fast.
I’ve been interested in this topic and researching for a while for an ongoing study. One method that looks promising is https://github.com/dragonfly/dragonfly but it’s very new. I don’t think there’s a de facto at this point, although most libraries I’ve seen use some form of bayesian optimization.
@kushaj Yes, usually, genetic algorithms take forever and a day.
BayesianOptimization cuts the time short to a manageable amount,
but only if your dataset and model is reasonable. In practice, 30 - 50 seeds and about 50 iterations are sufficient.
If you have dozens of gigabyte of data or more, I am afraid, Jeremy is spot on, you better stick to the tweaks discussed in the course.