4 hours to fit vgg16?

pchalasani · January 14, 2017, 4:24pm

I used batch_size=64 and path points to the FULL data-set (not sample).

I did
val_batches = vgg.get_batches(path+'valid', batch_size=batch_size*2)
vgg.finetune(batches)
vgg.fit(batches, val_batches, nb_epoch=1)

and it took 4 hours to fit the model, reaching 97% accuracy, is this normal?

My nvidia-smi is below. This is on a p2.xlarge instance

-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.55                 Driver Version: 367.55                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   60C    P0    62W / 149W |      0MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

lucas.shen · January 14, 2017, 5:34pm

Something is wrong. Usually it take around 500 seconds to fit an epoch.
It’s weird that you have 100% GPU util but non of your GPU ram is being used.

Did you see any error or warning message while importing keras or theano?

pchalasani · January 14, 2017, 6:08pm

I didn’t see any errors when importing those libs

pchalasani · January 14, 2017, 6:11pm

If it helps, this is the output that was printed:

 32/23000 [..............................] - ETA: 12233s - loss: 2.5792 - acc: 0.3750
   64/23000 [..............................] - ETA: 12087s - loss: 2.2562 - acc: 0.4531
   96/23000 [..............................] - ETA: 12015s - loss: 1.8609 - acc: 0.5417
  128/23000 [..............................] - ETA: 11935s - loss: 1.6583 - acc: 0.5938
  160/23000 [..............................] - ETA: 11924s - loss: 1.4924 - acc: 0.6188
  192/23000 [..............................] - ETA: 11896s - loss: 1.2756 - acc: 0.6667
  224/23000 [..............................] - ETA: 11871s - loss: 1.1411 - acc: 0.6964
  256/23000 [..............................] - ETA: 11859s - loss: 1.0644 - acc: 0.7188
  288/23000 [..............................] - ETA: 11829s - loss: 0.9665 - acc: 0.7465
  320/23000 [..............................] - ETA: 11802s - loss: 0.8959 - acc: 0.7625
  352/23000 [..............................] - ETA: 11788s - loss: 0.8356 - acc: 0.7756
  384/23000 [..............................] - ETA: 11789s - loss: 0.7719 - acc: 0.7917
  416/23000 [..............................] - ETA: 11764s - loss: 0.7140 - acc: 0.8077
  ...
22848/23000 [============================>.] - ETA: 78s - loss: 0.1297 - acc: 0.9680
22880/23000 [============================>.] - ETA: 61s - loss: 0.1296 - acc: 0.9680
22912/23000 [============================>.] - ETA: 45s - loss: 0.1296 - acc: 0.9680
22944/23000 [============================>.] - ETA: 28s - loss: 0.1295 - acc: 0.9680
22976/23000 [============================>.] - ETA: 12s - loss: 0.1297 - acc: 0.9680
23000/23000 [==============================] - 12864s - loss: 0.1298 - acc: 0.9680 - val_loss: 0.0783 - val_acc: 0.9825

Matthew · January 14, 2017, 7:54pm

@pchalasani

I used batch_size=64

32/23000 […] - ETA: 12233s - loss: 2.5792 - acc: 0.3750
64/23000 […] - ETA: 12087s - loss: 2.2562 - acc: 0.4531

It looks like your batch_size is 32.

pchalasani · January 14, 2017, 8:07pm

True but I don’t think that explains the 4 hour fitting time does it?

Matthew · January 14, 2017, 8:09pm

I agree, but it might be worth seeing how much time it takes off.

If you haven’t already, try restarting the instance and training Vgg before doing other things with the instance.

Gelu74 · January 14, 2017, 8:23pm

You should see in the last line of nvidia-smi output that a python proccess is running and using the GPU.
When you first import theano/keras, are your receiving a message on the use of GPU? something like:
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/init.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
warnings.warn(warn)
Using Theano backend.

Otherwise you are running on the CPU and hence the long training time

Nathaniel · January 14, 2017, 8:34pm

I got some some error and the fitting is very slow like yours initially. My error is related to NVCC compiler and I fixed by following a stackover flow answer:

In current image I cloned from this course, it is cuda-8.0. So I added the following to ~/.theanorc:

[cuda]
root = /usr/local/cuda-8.0

Seem the error is fixed after this. The training of epoch=1 of total xxx/23000, it is shown ETA at about 700s -800s for the full 23000 samples.

Hope this helps.

Matthew · January 14, 2017, 8:38pm

Yeah, the “No running processes found” is strange if Prasad is currently training. I think the CPU might be training the model. For certain training tasks I’ve seen a CPU take 20 times longer than a GPU, and 500s x 20 == 10000s, which is close to 12000s.

Although the batch_size equaling 32 might mess with my numerology.

pchalasani · January 14, 2017, 8:52pm

Thank you all for being so helpful!

Ok I killed that instance and started a fresh instance (via DataBricks) but found the nvidia-sim output to be at 98% GPU Utilization right away!

So I’m not sure what’s going on.

And of course I repeated the training (with batch_size = 64) and it’s still super-slow, un-surprising given the 98% GPU utilization while doing “nothing”. I wonder if some Spark process is soaking up the CPU.

I don’t think @Nathaniel’s fix would help, it doesn’t seem related. The root problem seems to be that the GPU seems to be occupied to begin with.

A great lesson in just how much the GPU helps speed things up

pchalasani · January 16, 2017, 1:05am

Well I finally decided to let go of my attachment to a certain way of doing things and do it exactly like @howard’s setup video says: basically run setup_p2.sh from my Mac to spin up the pre-configured AMI. It works like a charm (and Vgg16 trains in ~8 mins not 4 hours) except that I had to do a couple of things:

My AWS account already has a certain number of VPCs and I had to request a limit-increase via he AWS console page
Once I ssh to the instance, it didn’t seem to have the unzip command, so I tried to do sudo apt install unzip and got a strange error

E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.
``` and well, I ran the suggested command, which ran for about 5 minutes.
- Then I did `sudo apt update` and `sudo apt install zip unzip` to get those to work
- Then I did `git clone ...` to pull in the notebooks from the github repo
- I downloaded the data using `wget http://www.platform.ai/files/dogscats.zip` and ran `unzip` to extract it
- run `jupyter notebook` and went to the usual `...:8888` address and it all works fine

jeremy · January 16, 2017, 1:18am

What dataset? Can you put your notebook in a gist for us to look at?

pchalasani · January 16, 2017, 1:21am

Not sure I understood what you meant @jeremy

The dataset is the full dogscats dataset (not sample)

Is that what you were asking about?

jeremy · January 16, 2017, 1:23am

Yes it is. We use a lot of different datasets in the course so its important to mention which you are having problems with.

4 hours is very slow. If you can put all your code into a github gist we could take a look.

pchalasani · January 16, 2017, 1:30am

I mentioned in my comment that I used this dataset:

http://www.platform.ai/files/dogscats.zip

And incidentally the 4 hours training time was on a cluster spun up via the DataBricks, and a notebook in that environment, so my issue may have been specific for that environment, and may be hard for others to replicate.

The code is identical to the one in lesson1.ipynb, I am not doing anything different.

In any case as I said in my latest comment, when I instead follow your setup and use Jupyter notebooks, everything works fine.

jeremy · January 16, 2017, 3:00am

Sorry somehow didn’t notice the rest of the thread - only saw the first post!

tdphong · January 11, 2018, 3:51pm

I get the same ~ 4 hours ETA training time. I am using paperspace and the price is 0.65 $/ hour with GPU is P5000 which I think powerful enough. What did cause the slowness ? I am using notebook lesson 1 for training dogs vs cats