How does epoch work?


I just finished watching the first video lecture and I don’t quite understand the concept of epoch. I know how we’ve to have a test set and validation set and also various cross validation techniques like k-fold CV. Does epoch work the same way ?

So if mention epoch=1 it’d use the whole data set and training and validation will be performed on the same set? If I’ve epoch=2 it’s divide the data set in two halves, one for training and another for testing ? And similarly, with epoch=3, data is divided into 3 parts, 2 parts for training and 1 for validation?

Can someone validate my assumption ? I may be totally wrong and would like to get clarification on it.

1 Like

Epoch is just one full iteration through the dataset.

Here is an answer with a bit more background: Difficulty in understanding Epochs


Thanks. Much clear. I just watched first video, where do I specify the batch size ? And let’s say my batch size is 5 with 100 observations and I do 3 epochs, is it right to assume that 40 of the observations will not be part of training?

Epochs consist of batches. If you have a dataset that contains 100 images and you set the batch size to 4, it will be broken down into 25 batches, each containing 4 images.

With the same dataset, if you set the batch size to 20, it will break the 100 images into 5 batches, each containing 20 images.

If you opt to train for 3 epochs, regardless of the batch size, during training you will run through the entire dataset 3 times presenting all the images in it to the model. The difference being that with bs = 4 it will have to look at 25 batches per each epoch (though each batch will be smaller) and with bs = 20 it will see 5 batches per epoch.


Thanks a lot. My presumption was wrong. So if we’re looking at everything in each epoch what role does batch size play ?

We need a batch size that will fit our hardware. For example, depending on the amount of RAM in your GPU, you might need to lower the batch size.

The models nowadays are so complex that it is impossible to evaluate a derivative on all of the training examples in most of the datasets at once. And whenever it is possible, you should definitely give it a shot to see what results you get! (for instance, this can be done with the MNIST dataset and a very simple linear model, maybe even something slightly more complex but I can’t remember).

There is more to the interplay between a batch size and the training set size though. Quite often the images in a dataset will be similar to one another. Say you have 25000 images with cats and dogs. If you go for a random sample of 64 on average you will get 32 images of cats and 32 images of dogs. Chances are that based on the 32 images of cats you can say something about all the cat images in the dataset. That they are a representative sample and capture useful information that can be extrapolated to all cat images in the dataset. By looking at a sample of images we can perform weight updates more often then if we were to consider the entire dataset. Since our derivative is only meaningful locally, one could argue that taking smaller steps more often should help us get around the weight space more efficiently then if we were to take better directed but infrequent large jumps. I am not sure that that is the case though as there was a paper published stating that you can get similar results to looking at greater batches with increased learning rate…

Please take what I wrote above with a grain of salt - this is just some intuition I built up that lets me reason about things but is not necessarily 100% accurate.

The short and somewhat boring answer to the question why we look at things in batches, and one that is 100% accurate, is that for most of the problems we want to work on, there is not other way due to the limitation of hardware that we have. This is also the space that has been explored to greatest extent - I wonder what training on a TPU or whatever the beasts from google are called would be like with using much greater learning rates :thinking: If our batch sizes are much larger and hence derivatives more reflective of what they would be if they were calculated on the entire training set, would this reduce the value of adaptive learning methods?

Just some more or less accurate musings of mine :slight_smile: The answer still holds though: we need to look at things in batches since quite often we don’t have any other way.

This is something I might be completely hallucinating but I also think I ran into some statements at some point where for some NLP problems it seems that a smaller batch size works better? This might be completely not correct though.