Rebuilding SOTA architectures with Self-Tuning Hyperparameter Schedules

tylerdarwin · August 5, 2019, 4:08am

Hi all,

I’m new to the forums.

I’m looking to contribute by rebuilding state-of-the-art architectures, combined with best practices from Bayesian ML literature on how to tune networks, and any other tweaks that would make these architectures easier to train and more expressive (i.e. better activation functions, better initialization of residual/dense blocks, etc). I really want these models to be so easy, yet efficient that a person with no AI background could use them for practical problems.

Like, no AI background. Barely even coding skills.

If there are hyperparameters, they should tune themselves. Activation functions should be as expressive as possible, and weight initializations should be sound enough for training at very large learning rates (perhaps even one-cycle training).

Eventually, I want these models to be compact for use on embedded devices. Perhaps easy enough that they can be trained from scratch to SOTA performance.

As a proof of concept, I propose we benchmark on the well-studied datasets for 2D image classification (i.e. CIFAR, ImageNet).

Step 1. Modify DenseNets to include MPELU activations and add biases, multipliers, and scale Dense-block initializations according to FixUp init. I’ve attached a sketch of a Dense block would look that synthesizes these three ideas. As a sanity check, if all three of these are SOTA, would they complement each other (i.e. to allow one-cycle training)?

Do they obviate the need for batch norm? If not, we can add BN before/after each MPELU; it does not degrade performance like it would for ELU.

DenseNet,MPELU,FixupInit_Diagram.pdf (399.5 KB)

Step 2. Use drop-in replacements for conv2D and linear functions, with “hyper” counterparts (see Appendix G of this paper) that are self-tuning during the course of training the network. This doubles the amount of FLOPS per forward and backward pass, but DenseNets have half or fewer parameters than comparably-accurate ResNets (the condenseNet or memory-efficient versions compress even further).

In practice, hyperparameter schedules consistently outperform fixed values. This technique can create schedules for discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities.

I do not know if this self-tuning hyperparameter method would be compatible with one-cycle training. The authors do note that hyperparameters equilibrate at a much faster rate than weights, often reaching the same schedule regardless of their initial value. Perhaps they would equilibrate fast enough for use with very large learning rates on the weights?

Step 3. Investigate how to regularize and compress the models. As a first pass, we could look into targeted dropout, which allows for fine grain tuning of the sparsity of the network.

I’m open to suggestions. This is not exhaustive, just some of the essentials I want to apply to the proof of concept. I want to benchmark on 2D images, then try the same concept to rebuild best practices into other architectures (that relate to, e.g., 3D object detection, language & translation).

Thanks for any comments or critiques!

-Tyler

P.S. When S4TF becomes practical to use (in a year or two), maybe we can port these over to the new fastai library there! People could actually deploy these directly into applications, like mobile devices or embedded systems. Sounds fun, right?

tylerdarwin · August 9, 2019, 1:34am

@sgugger does this seem feasible to you? Specifically, how would I implement a DenseBlock as shown in this diagram?

DenseNet,MPELU,FixupInit_Diagram.pdf (399.5 KB)

The initializations are derived from the MPELU paper, and the biases/multipliers between the cascading connections are inspired by the Fixup article.

bluesky314 · September 13, 2019, 6:10am

Do you have any resources related to hyperparameter schedules that are not only focused on learning rate?

tylerdarwin · November 13, 2020, 9:26pm

Self-Tuning Networks: Bilevel optimization of hyperparameters using structured best-response functions schedules hyperparameters, discrete and continuous, but not learning rate.

Sorry for the delay.