Platform: GCP ✅

txuninho · July 22, 2019, 2:24pm

Hi @jeremy
Is there any easy settings to use GCP straight from the browser without SDK install ?
I managed to update .jupyter/jupyter_notebook_config.py because when launching “jupyter notebook --ip=0.0.0.0 --port=5000 --no-browser” I got an error “set notebookapp.allow remote access to disable the check”. But still I can access to a tree with directory “tutorials” but then I got some Server Error
Cheers

mgopitrinadh · July 22, 2019, 7:08pm

Hi All

I successfully got the fastai setup running on Google Cloud. The fastai notes on ‘Server Setup’ is really helpful in getting this up and running.
Now am able to run lesson-1 on GC instance :).

I am seeing below in my Google Cloud billing.

’Storage PD Capacity’ is continuously increasing day-by-day. Though the instance is stopped, the increase is noted. Accordingly the billing as well.

I just ran lesson-1 on the instance and stopped it. Over next 3-days, I see that PD storage is almost doubled, from 17 GB to 34 GB.

Can you pls help me to understand this better and to fix this. How can I stop PD storage to stop increasing.

Thanks
Gopi

yifan · July 22, 2019, 9:25pm

I followed the instructions for google cloud setup and when I type exec -l $SHELL I got this error bash: /anaconda2/etc/profile.d/conda.sh: No such file or directory

Does anyone knows what was wrong?

FYI, I recently install a new version of anaconda and when I type gcloud init I am able to choose configurations:

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
 [3] Switch to and re-initialize existing configuration: [my-project]
Please enter your numeric choice:

Am I good to go or should I resolve that exec -l $SHELL before I move further?

Advanced thanks!

jbuzza · July 24, 2019, 4:23pm

I have set up on GCP using the instructions on https://course.fast.ai/start_gcp.html.

I am actually wanting to do the NLP course so also cloned the github repo for that without any problem.

My question might seem silly but how can I be sure I am using the GCP GPU rather than my local one ?

I am using the jupyter environment initiated at localhost:8080/tree and the data file paths used in the notebook are those on the GCP machine but training is slow and seems the same as I experience on my local GPU.

How would I check what GPU is being used ?

jbuzza · July 26, 2019, 2:23pm

In case anyone else struggles with this, I was able to use the environment checker utils to check this out.

mgloria · July 30, 2019, 9:41pm

Does anybody know how to stop the machine from the jupyter notebook? e.g. when the training cell above stops running. See my full question here.

Mark_F · August 17, 2019, 3:55pm

After several failures with Salamander, I am trying to set up GCP. I have followed the fastai instructions, and have successfullly increased my GPU quota (acknolwedged by Google).

I am getting the following error, and have no idea what to do. Can anyone help me out? I don’t really know much about BASH so I am just following the instructions blindly:

“Dark Caldron” is the name that was assinged by Google when I created my account. Should I try creating another?

maincarey · August 18, 2019, 9:16pm

Still working through setting up GCP

Ran
gcloud compute ssh --zone us-west1-b jupyter@my-fastai-instance --ssh-flag="-L 8080:
localhost:8080"

and got this error

ERROR: (gcloud.compute.ssh) Could not fetch resource:

Insufficient Permission: Request had insufficient authentication scopes.

britt · August 19, 2019, 5:51am

From the terminal, run nividia-smi

Caleb · August 21, 2019, 3:05am

I’m getting the following error when trying to establish the ssh tunnel with

gcloud compute ssh jupyter@fastai-server -- -L 8080:localhost:8080

bind: Permission denied
channel_setup_fwd_listener_tcpip: cannot listen to port: 8080
Could not request local forwarding.

I’ve also run netstat -lep --tcp to check to make sure nothing else is listening on my local 8080 and nothing is.

This happened when I was restarting an instance and I also tried creating another one from scratch and got the same error. I think I’ve followed all the set up instructions correctly, but can anyone help me figure out where I’ve gone wrong?

Not sure if this is relevant, but when I go to localhost:8080/tree I get the following returned:

{
path: "$",
error: "resource does not exist",
code: "not-found"
}

SOLVED
Rookie error - leaving here in case useful to anyone else. I had a docker container running that was listening on local 8080 that for some reason didn’t come up when I ran the netstat command. Once I stopped the container, I was able to establish the tunnel.

ptats · August 23, 2019, 5:51am

Jupyter notebook really slow

I have setup my google cloud compute instance as per the instructions here:

https://course.fast.ai/start_gcp.html

However, the notebook is running really slow compared to some of the times shown in the notebook. I am running lesson 3 and after updating the model to take the full image size 256x256, it is taking nearly 7mins to fit one epoch.

During training I ran nvidia-smi and it showed this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   85C    P0    59W /  75W |   7597MiB /  7611MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2951      C   /opt/anaconda3/bin/python                   7587MiB |
+-----------------------------------------------------------------------------+

I also ran import torch; print(torch.cuda.is_available()) and this returned True.

Why is this still so slow? Am I missing something or is this just the setup on google cloud?

matmat96 · August 29, 2019, 2:41pm

Hi everyone,

I’ve just begun this course today, and I’m having trouble setting GCP… I followed the help greatly exposed. However, I still have two problems during step 3 (creating an instance):

I have this error message `ERROR: (gcloud.compute.instances.create) Could not fetch resource:

Quota ‘GPUS_ALL_REGIONS’ exceeded. Limit: 0.0 globally.` I did increased the quotas as advised (up to 32!) but this message is still displayed anytime I try to create an instance.

I have this warning message `WARNING: Some requests generated warnings:

Disk size: ‘200 GB’ is larger than image size: ‘30 GB’. You might need to resize the root repartition manually if the operating system does not support automatic resizing. See https://cloud.google.com/compute/docs/disks/add-persistent-disk#resize_pd for details.` However, when in the code suggested for the step 3 I replace 200GB by 30GB, I have another warning message telling me it may be insufficiant.

Does anyone have any idea? I’ve been trying since yesterday, and this is getting a bit annoying…

Thank you in advence!

AndreaPi · September 4, 2019, 6:36am

I’m getting the same warning!

    Disk size: ‘200 GB’ is larger than image size: ‘30 GB’. You might need to resize the root repartition manually if the operating system does not support automatic resizing. See https://cloud.google.com/compute/docs/disks/add-persistent-disk#resize_pd for details.

Does anyone know what I should do? Thanks!

chrispmaag · September 5, 2019, 7:11am

Hi,

I’m having issues connecting to Jupyter notebook after successfully connecting to my GCP instance. I’m following the ‘Returning to GCP’ document at https://course.fast.ai/update_gcp.html. I ran gcloud compute ssh --zone=us-west1-b jupyter@my-fastai-instance-p100 -- -L 8080:localhost:8080. I was then able to successfully update the course repo and fastai library. But when I try to open http://localhost:8080/tree in a broswer, I get Connection refused. I’m on a Mac. Could someone point out what I’m missing?

pierreguillou · September 5, 2019, 10:34pm

Hi. Like @Shubhajit I’m running into SSH problems with GCP.
When I’m running the nn-vietnamese.ipynb notebook from the nlp-course of Rachel on my GCP instance, everything goes fine until the training of the learner: after some time (each time different), the connection to my instance is broken by GCP (the instance keeps running) and I get the following error message in my Ubuntu terminal (I’m using Windows 10):

Connection reset by xx.xx.xxx.xxx port 22
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

I did a lot of Web search and tried solutions from TroubleShooting SSH and this post on I’m Coder but without success.

More, I turned Off the Preemptible option.

Any ideas? How to train a NLP model (that needs time) if GCP stops the ssh connexion?

Shubhajit · September 6, 2019, 3:49am

@pierreguillou I found some temporary workaround here, changing the network (mobile hotshot - > public wifi) helped most of the times. Don’t know why this is the case, but it’s working.
First I thought my ISP was blocking the port (22), but later discovered, it wasn’t.
This is frustrating!
I would really appreciate if someone from GCP team will look at it.

sambaths · September 6, 2019, 11:47am

Hi, I tried setting up GCP as per the tutorial ( I used Google Cloud Shell instead of Ubuntu Terminal)

After i ran,

gcloud compute ssh --zone=$ZONE jupyter@$INSTANCE_NAME -- -L 8080:localhost:8080

I got the following error:

ssh: connect to host port 22: Connection refused
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

How can i go about in fixing this issue?

pierreguillou · September 6, 2019, 11:49am

Hello @Shubhajit. Thank you for your answer, but in my case, I use my Internet connection at home (not a cellular connection or a public wifi connection). But what your experiences mean is that the lost of SSH connection would come from the ISP, not from the GCP. If this is the case, it would mean that there is no way to use a cloud GPU to train large DL models such as Language Model, at least in the conventional way (ie, from home computer):

gcloud compute ssh --zone=ZONE jupyter@MY-FASTAI-INSTANCE -- -L 8080:localhost:8080

If true, it would be better to launch the connection to the GCP instance not from my home computer terminal but from an online terminal (to avoid to get my ISP between my instance ).

An idea that makes sense? Possible? Cloud Shell on GCP could allow to do that?

pierreguillou · September 6, 2019, 9:33pm

The answer from Jeremy: launch your notebook in a tmux session on your GPU online platform.

AndreaPi · September 7, 2019, 2:13pm

I’m not sure if this is the right place or if I should open a new thread. Anyway, the pricing section in

https://course.fast.ai/start_gcp.html

is either wrong or outdated (meaning that GCP became vastly more expensive, and the guide needs to be updated). As a matter of fact, the price of the standard compute option is estimated to be (80 hours of homework plus the 2 hours of working through each lesson and 2 months of storage):

Standard Compute + Storage : (80+27)$0.38 + $9.6*2 = $54.92

Using the official GCP calculator we get instead:

per month. Since the course duration is 2 months (in the above scenario), the total cost will be 112.9$. The main error in the https://course.fast.ai/start_gcp.html estimate is the cost of storage per month, which is 40.8$, not 9.6$.

This is the command I used to build the instance:

gcloud compute instances create $INSTANCE_NAME        \
 --zone=$ZONE        \
 --image-family=$IMAGE_FAMILY       \
 --image-project=deeplearning-platform-release   \     
 --maintenance-policy=TERMINATE     \    
 --accelerator="type=nvidia-tesla-p4,count=1"  \    
 --machine-type=$INSTANCE_TYPE      \  
 --boot-disk-size=200GB    \    
 --metadata="install-nvidia-driver=True"      \
 --preemptible

and the pricing I got was consistent with the estimate (8.9$ for just 3 days of storage).