Lesson 10 Discussion & Wiki (2019)

sergeman · April 6, 2019, 8:21pm

We can think about batch size is a parameter that influences search for minimum of loss function during gradual descent optimization. Since at the end of a batch processing usually there is an optimization step, there would be more steps of optimization per epoch with the smaller size batch. With the size of the batch equal to number of samples in the training set there would be only one optimization step because there would be only one batch. However this step would be made considering all available training data. Using all training data means that all information contained in the training data would be considered during computation of the step. However considering all information at once might not be a good idea because in multidimensional space of the loss function surface there might be crevices and folds leading to deeper valleys of the loss surface that would be missed with larger batch sizes. Reducing batch size allows gradual descent to explore tight spaces. On the other hand, reduced batch size reduces amount of information considered during gradual descent step, hence leading to steps in wrong direction sometimes. Achieving balance between precision of the gradient descent step and the curvature of the search space is what is being adjusted by finding right value for batch size hyper parameter.