Machine Learning to Automate Learning

Discussion brought over from discussion with @stas in Developer Chat about how to use ML to automate tasks such as selection of batch-size, architecture, learning rate, etc.

In this context gc.collect() is only useful for cleaning up the mess left after deleting problematic objects that don’t clean up after themselves (learn is one of those). But the key is torch.cuda.empty_cache() if you are relying on nvidia-smi for visual memory monitoring.

I think that’s great. I was also experimenting with batch sizes to take full advantage of available GPU RAM, but ideally, we would have the Learner or a peripheral do that dynamically, which I think you may already be working on in ipyexperiments. :+1:

It should be trivial to automate that - just catch OOM in fit() and reduce bs and try again till it works.

The problem is that bs is not the only hyper-parameter that affects the memory foot print. And the user needs to have control over what they choose to increase/decrease because the outcome will be impacted a lot by an intelligent choice - a process which can not yet be automated fully.

Eventually, libraries like fastai will have machine learning built-into their decision making process, so that they could make such intelligent choices, but we aren’t quite there yet. We are building ML components, but we aren’t using them to make better ML components (yet).

In the future, you will have fastai learn your behavior as you tweak hyper-parameters and re-run the training, and try to anticipate your choices for you. And of course, gather the intelligence from all fastai users so that the community can share that intelligence, and new users don’t have to train their fastai install and can use a pre-trained fastai.

2 Likes

I see

Amazing

@stas now that you bring this up it is so clear. What are the smallest incremental steps that we could implement where the library could learn from user behavior? Can/should we start a new thread for those ideas?

Probably it’s a good idea, @bfarzin

I think this would be an unsupervised learning problem.

Let’s use vision as an example.

The space to be searched will include the main hyperparameters/settings:

  • lr, bs, wd, mom, etc.
  • image size
  • transforms

So we will need to define a loss function which would include:

  • fitting the given free GPU memory
  • training time
  • improvement of metrics (short/long term)
  • underfitting/overfitting
  • freezing/unfreezing layers
  • etc.

All the initial values and ranges to try can be designed using the knowledge we already have (e.g. forums, lessons, personal experience).

And once we have all that in place, then we can probably replace fit/fit/unfreeze/fit sequence with just:

learn.do_the_magic()

and it will churn for a while and return the best sequence to run to reproduce the best outcome. Or perhaps a few of them for different definitions of the best outcome: fastest, “bestest”, slimmest.

Now, we will have a user training this model through the process of running this, then we need to think about how to re-use this train model for a variety of datasets. And then how can we gather the pre-trained models from different users so that we have a community pre-trained model, which could be run right away, and then “unfreeze” to fine-tune for the unique situation.

Reads like a piece of sci-fi, but why not, all the tech we have nowadays started as a sci-fi.

3 Likes

That will be the function call! I love it.
A long time ago I used spearmint to select hyperparams from a range of possible values and then find the best loss function possible. That was more about selecting architecture params (how many layers, how many nodes per layer, etc.) than about the fitting params itself.

I will bite on the vision idea. It is as good a place to start as any. Seem like you already have an objective function that will work with memory size. What are all the params that would effect GPU memory for, say, dog breeds? Then we could build a net that predicts that and trys to get it right the first time?

It’s very possible that the initial approach can start with a good old simple search algorithm. The ML approach would only become beneficial once there is enough experimental data to rely on.

In addition to my notes earlier, I think it’d speed things up a lot if the do_the_magic function were to receive a hand-crafted dataset subset, so that it’d do initial selection of parameters in a much faster way. So that it has enough of representative train/valid data to experiment with first and then refine/validate the best choices against the larger dataset.

I will bite on the vision idea. It is as good a place to start as any. Seem like you already have an objective function that will work with memory size. What are all the params that would effect GPU memory for, say, dog breeds? Then we could build a net that predicts that and trys to get it right the first time?

same as any other image categorization problem. but I understand that you’re posing a concrete problem, so dog breeds it is.

And then are you suggesting as the first goal is not to optimize for best outcome but for just the optimal parameters to fit into available RAM to allow training to complete?

In this scenario the main parameters would be bs, image size and the model choice (its size). I think it enough to run a single batch through the model to get all the allocations in place - no need to run the whole epoch, and subsequent epochs do not consume more GPU RAM. Also do note that the very first batch of the first epoch always consumes more RAM than all the subsequent runs, but the pytorch allocator can scale down if there is less memory available - it is optimized for speed to not free temporary allocation if there is extra GPU RAM, but if there is not, it’s slower but fits into less memory. Therefore this first batch memory allocation shouldn’t be used as the actual memory needed (but it’ll require slightly more RAM than the second batch will use). Moreover due to GPU RAM fragmentation it’s almost never the case where all of reported free memory is actually 100% usable.

So one approach is to try to measure how much memory is needed to run with given parameters, in which case you’d want to measure the 2nd batch and not the first. (or the difference in the used memory before starting training and after batch 2). You could inject some profiling inside fit() using GPUMemTrace https://github.com/fastai/fastai/blob/master/fastai/utils/mem.py#L127, or using the callback metric PeakMemMetric https://github.com/fastai/fastai/blob/master/fastai/callbacks/mem.py#L9

The other approach is to start big, and hit OOM via try/catch, reduce and then re-run until OOM is no more.

Note that we are working on making fastai very slim, @sgugger added learn.purge yesterday that basically can reset memory usage to 0 plus model size any time during the training (i.e. between trainings). Since cuda allocates some 0.5GB per process to its context, you end up having 0.5GB + model size used but otherwise the rest of GPU RAM should be free (assuming you didn’t create your own vars that consume GPU RAM)

Did I miss anything? I don’t think optimizer and metrics params will make any noticeable impact, but I could be wrong - I haven’t looked into these.

2 Likes

I want to start there with this particular problem and see if I can understand all the mem usuage constraints. It could be that some things are linear (2x bs = 2x memory) and others are not.

Something about the try/catch method seems to appeal to me. You try the biggest setup you want to try, then start to scale down till you can’t fit it. do_the_magic() would return a few options. They would be triples of params that fit on your GPU and you pick the setup that you want to run out of those options.

Let me try to play with these things, see what headway I can make. I am open to other ideas/directions as suggested or recommended from the community as well.

1 Like

This is great. I agree that at some point we should be able to just set memory and time constraints and optimize for validation metric. I read a paper couple of weeks ago about nice comparisons between: random, grid and bayesian optimization. Will try to find it and attach here.

Just run learn.do_the_magic() leave, enjoy the rest of the day and come back to see yourself in top 10% in a Kaggle competition :smiley:

For a competitive result I believe error analysis will make the real difference, so inference modules can also play a significant role in this iterative ML dev process:

I would most probably be happy with a flow like this:

First Collect baseline data:

  • Do the magic
  • Run Inference, analyze as a human. Do changes on data, loss function or task etc…(sometimes it helps to look at a different angle to solve a problem rather than the most obvious way, e.g. DS BOWL 2018 winners) - This might be the part which will be automated last as it involves creativity
  • Do the magic
  • Repeat

This is a very exciting time and I agree this shouldn’t be very far :smiley:

2 Likes

A post was merged into an existing topic: Understanding GPU memory usage

I have a simple notebook (prototype) that catches OOM and then reduces the batch size. In the example, it iterates over different resnet architectures.

I’m also using ipyexperiments to test out different outputs as I reduce the available memory.

I am looking for thoughts/feedback on how this is setup and how it can be improved. Next step would be to scale image size up/down.

I think with a few hours and some range of params we could build a table that gets you pretty close to the max image size/batch size for a given arch/memory available. I think that is probably a smooth surface.

2 Likes

looking through your output, you’ve got 128GB of RAM - I wasn’t prepared for it - now adjusted ipyexperiments to dynamically calculate the report col width and added commas in large numbers. So it should look prettier/more readable.

1 Like

What’s the status here? Now that StopAfterNBatches is in the library, it would be great to have automatic bs detection.
Your notebook @bfarzin does not exist anymore.