Hey Zachary, thank you for this, I had a few issues getting started but it seems to be working now. One thing that I wonder about is, have you explored the nature of the variance with regards to hyperparam optimization? We would prefer not to do full training, so the two weapons at our disposal for cutting down on time are
-
use_partial_datato only use a subset of the data and speed things up - train fewer epochs and hope the relationship between the hyperparams and accuracy holds.
Obviously there must be a point where your results are pure noise/variance, like 1% of data for 1 epoch, but do you have any sense of what amount of data and epochs you need to get to get consistent results that actually correlate with higher accuracy in full training?
I wrote a small grid search library and one function it had was a variance test. It would run a single set of params n times using the decided upon amounts of data and epochs, and would then show you the range of results that came back. If the accuracy was more or less the same for every run, you were good to go, if there were 5% swings in accuracy, you probably weren’t going to learn anything and had to increase data % and epochs. Any ideas for this re: bayesian optimization?