01_Intro Running Your First Notebook takes > 5 mins on AWS Ubuntu with GPU

sridharaiyer · December 27, 2020, 6:08pm

Hi The following code took > 30 mins on an AWS Ubuntu p2.xlarge with NVIDIA GPU

# CLICK ME
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

What am I missing or doing wrong?

manju-dev · December 30, 2020, 10:36am

It could be because of low powered GPU.

p2.xlarge has K80 GPU

Whereas, Paperspace Gradient has P5000 GPU (as recommended in fastai course site), which is much faster.

I have also gotten good performance on Google Colab.

shantanu7jha · May 7, 2021, 9:43pm

it is the same for me. I’m running this on colab, can’t seem to figure out what to do in order to reduce the training time.
Any leads would be appreciated.

aaekay · March 9, 2022, 9:10pm

I was doubtful of that too but total time is

of downloading the dataset 700 mb dataset taking 15 mins to download on my network.
model download - 3 sec
then training - since i had gpu of 2 gb i used lower batch size so ti doesnt go out of memory - in two epocs it totally took 4 mins
so total time = 15+4 around 20 minutes.

klausk · March 30, 2022, 3:16pm

Hi,

I see the same problem using Colab (free account). It runs since 38 min now and doesn’t look like it’s finishing soon. What could be the problem?

Best,
Klaus

mike.moloch · March 30, 2022, 11:49pm

Well, it’s all relative I suppose… A K80 is still faster than my T3600+64GB RAM and a 1070ti+8GB VRAM (yes it’s old). Which takes about 56 minutes to finish this first notebook.

Previously it had taken more than a couple hours and died with out of memory errors so I had to bring the batch size down to 24 and it finished in almost in 1 hour.

P.S. I think I need a faster machine with a faster GPU

klausk · March 31, 2022, 7:46am

Thanks for the reply!

Watching the video suggests, that it would just take a couple of seconds. That’s what confuses me…

mike.moloch · April 1, 2022, 7:36pm

Depends on which step though. Because I have never been able to run the whole 01_intro notebook in a couple seconds. Maybe my setup is too old (which is quite possible.)

matdmiller · April 1, 2022, 9:17pm

Most likely in the video the dataset is already downloaded and Jeremy is probably running it on a private server with a much faster GPU, CPU, Ram and Storage, all of which can make a difference (GPU being the biggest difference).

My 1080ti was much faster than a K80. I can’t remember by how much but I believe it was > 2x faster so I would expect your 1070ti would be faster as well.

Playing with batch sizes also can make a huge difference. I did some benchmarking on the NLP model here on Colab and my personal machine w/ a 3090.

I ran a benchmark of the 01 intro notebook on my local machine for reference as well. Not sure how long the dataset downloads took, but you can see the training times. It was definitely several min including the download time for the datasets. When you use colab, you have to re-download the datasets and models each time your notebook instance is released which makes it take longer than if you have a dedicated machine with all of that pre-downloaded. In my case I did not have the models or datasets pre-downloaded on my local machine for several of the models. If you’re using a dedicated instance on AWS that you turn on/off each time you use it, you should not have to re-download the models and datasets which saves time.

01_intro-Copy1.pdf (1.4 MB)

mike.moloch · April 1, 2022, 10:02pm

Actually, I may have been recalling something that I looked at a while back, so maybe the 1070ti is slightly faster for the first notebook (not taking into account the download times). I also didn’t see the sentiment step in your PDF. That is what seems to take 12min per epoch on the 1070ti. Everything else in that notebook is pretty fast. I am using the fastai container so maybe my notebooks are not the most recent.

EDIT: Actually I do see the text data loader step in the PDF (step 15) . In your output it takes 1:18 min per epoch. On my 1070ti (bs=24) it takes about 12 min per epoch. So I’m guessing the PDF you posted contains results from your 3090 setup? That’s impressive. Maybe I’ll be able to afford a 3090 once the 40 series comes out. OTOH, I’ll probably need a whole new box because the DELL T3600 just doesn’t have the juice to run a 3090 tbh.

matdmiller · April 1, 2022, 11:12pm

Yep. I built it in I think 2017 and upgraded from 1080ti’s to a 3090 last year. The power consumption on the 3090 is ~420W. Pretty crazy. I bought a 1000W supply when I built the machine and it drives one card just fine but probably not 2.

Going to a card with Tensor cores and switching to fp16 makes a huge difference. I was close to maxing out GPU ram w/ bs of 384 and 28s epochs using fp16 vs max batch size of ~176 and 60s epochs using standard fp32 on the same card. I attached a pdf to my post on the other thread as well w/ a bunch of different tests. It looks like it cut off the fp16 call at the end of the learner line, but you can still see the . (dot) at the end of the learner creation line to know where I applied it vs where I didn’t.

mike.moloch · April 2, 2022, 5:12pm

Yeah and 3090Ti is pushing half a kilowatt for a 7-10% bump. Off topic, but I really wish apple would take a few devs and assign them full time to helping the pytorch team so it can be ported to their M series hardware. They’d probably see a significant bump in sales if pytorch becomes usable on their silicon. Pytorch team seems to be be struggling with having to port it to the M series chips it seems.