Struggles with completing lesson 1 on Google Colab due to its GPU limitation

I got inspired by Manikanta’s “Fast.ai Lesson 1 on Google Colab (Free GPU)”

and for a few days now have been trying to get the first lesson’s notebook run there, unsuccessfully so far.

Either things fail due to lack of memory, or some other errors crop up. Even with sz=60 and bs=16 I still am unable to complete the run.

I tried a few forks of the code base and notebooks people posted here in the forums to no avail. It’s identical hardware/software for anybody running on google colab GPU so I’m unsure why some people reported success in several threads here.

Do any of you have a lesson 1 v2 notebook that you could share that you know that completes on google colab?

Thank you!

2 Likes

Hey,

Did you see the thread Colaboratory and Fastai ?

I posted on there extensively about solving issues with memory, essentially for some odd reason some people are having an issue on Google Colab with when the code compiles operations for the GPU, there is an almost ‘double compiling’ thing happening. But check it out on the thread above, I shared a notebook which I forked and edited, it seems like it’s been working for some people while others have a new issue which has yet to be sorted out. But still, take a look, maybe it’ll fix your problem! :slight_smile:

Yes, Jacob, I read that thread and I tried to run the notebook you kindly shared and it fails early on in the very first training call. I thought perhaps there were some changes in the main version of fastai, so I forked the current version on github, applied the changes you suggested to remove do_gpu and it fails just the same.
That’s why I’m asking for help :frowning:

running your branch:

!pip uninstall -y fastai
!pip install git+https://github.com/jfpettit/fastai

this section fails:

arch=resnet34
#data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), bs=32)
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), bs=16)
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 3)

Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /content/.torch/models/resnet34-333f7ec4.pth

100%|██████████| 87306240/87306240 [00:03<00:00, 25361894.91it/s]

0%| | 0/1438 [00:00<?, ?it/s]


RuntimeError Traceback (most recent call last)
in ()
2 #data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), bs=32)
3 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), bs=16)
----> 4 learn = ConvLearner.pretrained(arch, data, precompute=True)
5 learn.fit(0.01, 3)

/usr/local/lib/python3.6/dist-packages/fastai/conv_learner.py in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, **kwargs)
97 def pretrained(cls, f, data, ps=None, xtra_fc=None, xtra_cut=0, **kwargs):
98 models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg, ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut)
—> 99 return cls(data, models, **kwargs)
100
101 @property

/usr/local/lib/python3.6/dist-packages/fastai/conv_learner.py in init(self, data, models, precompute, **kwargs)
90 elif self.metrics is None:
91 self.metrics = [accuracy_thresh(0.5)] if self.data.is_multi else [accuracy]
—> 92 if precompute: self.save_fc1()
93 self.freeze()
94 self.precompute = precompute

/usr/local/lib/python3.6/dist-packages/fastai/conv_learner.py in save_fc1(self)
142 m=self.models.top_model
143 if len(self.activations[0])!=len(self.data.trn_ds):
–> 144 predict_to_bcolz(m, self.data.fix_dl, act)
145 if len(self.activations[1])!=len(self.data.val_ds):
146 predict_to_bcolz(m, self.data.val_dl, val_act)

/usr/local/lib/python3.6/dist-packages/fastai/model.py in predict_to_bcolz(m, gen, arr, workers)
12 m.eval()
13 for x,*_ in tqdm(gen):
—> 14 y = to_np(m(VV(x)).data)
15 with lock:
16 arr.append(y)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
323 for hook in self._forward_pre_hooks.values():
324 hook(self, input)
–> 325 result = self.forward(*input, **kwargs)
326 for hook in self._forward_hooks.values():
327 hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py in forward(self, input)
65 def forward(self, input):
66 for module in self._modules.values():
—> 67 input = module(input)
68 return input
69

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
323 for hook in self._forward_pre_hooks.values():
324 hook(self, input)
–> 325 result = self.forward(*input, **kwargs)
326 for hook in self._forward_hooks.values():
327 hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py in forward(self, input)
275 def forward(self, input):
276 return F.conv2d(input, self.weight, self.bias, self.stride,
–> 277 self.padding, self.dilation, self.groups)
278
279

/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in conv2d(input, weight, bias, stride, padding, dilation, groups)
88 _pair(0), groups, torch.backends.cudnn.benchmark,
89 torch.backends.cudnn.deterministic, torch.backends.cudnn.enabled)
—> 90 return f(input, weight, bias)
91
92

RuntimeError: Input type (CUDAFloatTensor) and weight type (CPUFloatTensor) should be the same

The error is very clear…

You have actually made your tensors on GPU and CPU…

We can’t have them on both…(either use cuda or you for both else cpu)…

OK, but I haven’t done any changes to the code in lesson 1 notebook (only reduced sz and bs). This problem doesn’t happen when I run fast.ai’s code base. It does when I run jfpettit’s fork with 2 to_gpu() calls removed. Perhaps those removed to_gpu() calls are actually the culprit. But according to jfpettit he had to do it to make it run on colab.

I used the lesson 1 notebook from https://towardsdatascience.com/fast-ai-lesson-1-on-google-colab-free-gpu-d2af89f53604 which had tweaks in the setup to make it run on colab, but otherwise I believe it’s identical to the unaltered lesson 1 notebook.

All I am trying to do is to complete the run of unmodified notebook from lesson 1.

I have been working on coding a memory footprint debug function so that it’ll be easier to get reasonable sample/batch sizes to match the given hardware limitations.

General RAM was easy, but GPU RAM seems to be an issue.

On Google Colab I found /opt/bin/nvidia-smi (needed by GPUtil), but it seems to be always starting with 95% GPU memory used. Any idea why? Here is a quickly-made memory debug function:

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn't guaranteed
gpu = GPUs[0]

def printm():
  process = psutil.Process(os.getpid())
  print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
  print('GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB'.format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))

printm()

So when run before anything else in the notebook I get:

Gen RAM Free: 11.6 GB  | Proc size: 666.0 MB
GPU RAM Free: 566MB | Used: 10873MB | Util  95% | Total 11439MB

I think restarting the runtime gets you pretty much the same instance, only once in many re-connects I saw GPU utilization of 0% at the beginning of the run. Most of the time I get 95% used before anything happened.

Either this util doesn’t report the right numbers, or I got tied to a GPU instance that doesn’t clear its memory, or Google found a way to share one GPU between multiple users, giving only 5% to each user - so only 566MB GPU RAM is practically available for the runs?

Just before the notebook crushes in running:

learn.fit(lr, 3, cycle_len=1, cycle_mult=2)

I get:

lr=np.array([1e-4,1e-3,1e-2])
printm()

Gen RAM Free: 7.3 GB  | Proc size: 5.2 GB
GPU RAM Free: 168MB | Used: 11271MB | Util:  99% | Total: 11439MB

So clearly there is almost no GPU memory left and so it crushes with cuda out-of-memory error.

At the very least now I can see numerically why the notebook can’t complete the run.

Also I discovered that if that output is correct their GPU is only 11GB and not 22GB as it is supposed to be according to the spec.

Do you also see 95% GPU memory utilization when you run this little memory dump function on Google Colab? In which case we get 13GB RAM and 566MB GPU RAM for each instance.

5 Likes

I have exactly the same issue, and it seems that many are getting this: https://stackoverflow.com/questions/48750199/google-colaboratory-misleading-information-about-its-gpu-only-5-ram-available

Did you make any progress with solving this?

Well, that’s my question @ stackoverflow.

No, there is no solving that, if it’s set to be that way by Google.

Through many attempts only twice I was granted access to the whole GPU, late in the evening hours.

The bottom line, Google Colab as of this writing is not suitable for fast.ai lessons., unless some lessons have a tiny need for GPU RAM.

Let’s hope this will change in the future. Until then it’s either your home hardware or the paid-by-the-hour services.

2 Likes

Didn’t notice it was your post. Thank you for the response. Let’s hope the situation with Google Colab changes in the future.

Hi Stas,
I have a question, is the limited GPU memory the reason for getting your initial error “RuntimeError: Input type (CUDAFloatTensor) and weight type (CPUFloatTensor) should be the same”?

Actually, when running your memory debug function, It looks like I have 11 GB free. so I guess the answer is no :sweat_smile: since I am still getting the same error!

I was just able to bypass the problem!

Installing the fastai library from https://github.com/fastai/fastai (not https://github.com/jfpettit/fastai) solves it.

This article lists down steps for setting up fast.ai course on Colab.

This automates env setup, dataset and code download on Colab.

Disclaimer: I am one of the creators of Clouderizer.

2 Likes

@prakashgupta Thank you for the setup option. However, I seem to have some problems with the download of the fastai git repository. It worked for me once, that all files were download. Unfortunately, I messed something up and I wanted to restart from the beginning and I deleted the Clouderizer Project and the Colab Notebook. I created both again, but now not all the files from the fastai git repository are copied. All the lesson jupyther notebooks are missing. Do you have any idea what the problem is?

UPDATE: When I restarted Google Colab with !kill -9 -1 it seems to work again. Otherwise, as a workaround it should probably also work to clone the fastai repository manually and exchange it with the incomplete one. Anyway…this way everything seems to work now. Maybe just a minor thing: the widgets don’t seem to work inside the jupyther notebook, when I start it over Clouderizer.

1 Like

@prakashgupta, can you please start a separate thread for your project, otherwise you are hijacking this thread, which has only partial relevance to what you want to discuss/contribute.

The intention of this thread is for us to figure out how to complete the fast.ai lessons and the problem at the moment is the lack of hardware resources provided by Colab (GPU) and not the lessons’ setup. I had no problems with direct setup, as it was explained in the original post.

Thank you.

p.s. I edited the name of the thread to be more specific about what this is about.

1 Like

@stas I am sorry if my post came across as out of context, this was never the intent.
Using the steps mentioned in the article I posted, one can complete fast.ai lesson 1 on Google Colab without any error. I have personally verified it 3-4 times on different colab instances. Clouderizer setup allows us to run fast.ai lessons on Jupyter notebook (rather than Colab notebook) and hence we can use original fast.ai Github version notebooks without any changes.

I am not sure if Tesla K80 GPUs available in Colab are limited. I took a dump of nvidia-smi from colab instance Jupyter Terminal and compared it with GPU usage of same fast.ai project running on an AWS p2.xlarge spot instance. GPU memory consumption and utilizations were pretty similar between the two. Training speed on both machines was very close as well.

1 Like

Glad to hear it worked for you. We can discuss the widget issue in another thread.

updates:

  1. tried clouderizer - unfortunately no magic happened - the 556MB limit is still the same problem, failing with cuda memory error, and need to tweak the notebook to reduce size. Thanks to Prakash’s input on Medium, we now know that some people get access to 11GB of RAM (he does in India), others only 0.5GB (I do in Canada).

  2. I did succeed to run the whole lesson one on 556MB by splitting lesson1 into 2 notebooks. Both have the same setup parts, but the second notebook’s code (after copying the setup) starts with Improving our model.

I used my GPU memory diagnostics function to find a good breaking point.

And I had to use a very small sz=28. Such tiny images clearly won’t give good learning, but you can still follow the lesson. I might play some more to find a larger sz that still fits into the limited memory. part2’s run definitely has no memory to spare, part1 has some extra.

I placed the working notebooks here if you want to use those:

4 Likes

I met the same issue, and I found my problem was caused by the code below:

from tensorflow.python.framework.test_util import is_gpu_available as tf
if tf()==True:
  device='/gpu:0'
else:
  device='/cpu:0'

I used your Code to check the GPU memory usage status and find the usage is 0% before running the code above, and it became 95% after.

it’s not fastai related, but I think my finding may help you guys find the root cause of your problem, maybe some simple codes are consuming most of the memory to make the learning codes unable to fully run it’s cause…

for me, I simplified your code for my purpose and get my problem solved, thx for your inspiration

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
try:
  import GPUtil as GPU
  GPUs = GPU.getGPUs()
  device='/gpu:0'
except:
  device='/cpu:0'
  
print(device)