Problems: 1. creating mini batches. 2. CUDA memory error

crayoneater · March 17, 2022, 3:36am

Hi Dan, for #1 you will usually be better off increasing the number of images in your training set. That almost always helps things along. Other things you may want to try are adjusting the number of layers or nodes in your neural network, augmenting the data (as discussed in what others have posted above) and increasing the dropout rate. I’m sure all of these are covered somewhere in Part 1 and/or 2, but I confess I am an old-school Andrew Ng/Coursera deep learner, and am less familiar with the fast.ai course progression.

The more data and/or complex features you add, the more epochs you will probably need to run, as the model will be slower to converge. But you are more likely to get better results. This is very different than just running for more epochs without changing anything else (your second suggestion). If you do that, the model will appear to keep getting better and better on your training set, but eventually it may overfit and your real world results will actually get worse. This is the difference between training loss and validation loss (the latter is what I would call your “K parameter”).

For #2, if your purpose is to simultaneously compare results based on two different training sets, I think you’d have to run that on two machines (or one machine with 2 GPU’s, using a different run on each GPU). There’s another concept called data parallelization, where you spread a single training set over multiple GPU’s, but that doesn’t sound like what you have in mind (and I’m not very familiar with it).