Anderson Type-I accelerated gradient descent has been shown to be more effective at global convergence via non-smooth fixed point iterations than Adam.  We could then take advantage of large batches in the SGD routine to stabilize convergence, by adding structured covariance noise, such as diagonal Fisher noise.  We could stabilize SGD in general by adding logarithmically scaled momentum terms to the gradient updates.  For all of the hyperparameters, we could try a hyperparameter schedule that is planned or coordinated during the training phase (by using an approximation of how the hyperparameters vary according to a rank-one affine transformation of the weights) for dropout probabilities as well as data augmentation and discrete hyperparameters.  This might avoid some of the guesswork involved in performing a grid search to select a valid set of hyper parameter values in the typical, manual way.  Finally, we could prune the network in a Kronecker-factored eigenbasis , which allows for weights to be culled more easily than may be accomplished with a modified dropout with ramping up probabilities during the training phase. 
If anyone has any interest in implementing these in the fastai library, feel free. This may be related to the chapter on callbacks and modifying the training loop from the textbook published by Howard & Gugger in 2020.