GCP model training

shub.chat · November 12, 2018, 10:13am

Since last 2 days I have not been able to run a single epoch on GCP, suddenly the time per epoch has doubled and somehow keeps on increasing. Has someone observed the same?

timbo72 · November 12, 2018, 10:36am

This seems to be an ongoing issue for many people. I’ve been completely unable to provision a machine with a GPU since Friday. Good thing I spent some time on Thursday migrating all my work from paperspace.

shub.chat · November 12, 2018, 10:38am

I am able to get an instance but no luck in training any epoch.

jboy · November 13, 2018, 7:56am

On Sat & Sun, I was unable to either restart my existing GCP GPU instance or start up a new one. I tried in US West & Australia.

Apparently this was a worldwide problem for GCP. They reported it as a Kubernetes problem, but according to a thread on Hacker News, other services were also having problems.

Yesterday I was able to restart my instance & start up a new one, so I’m guessing it’s all working now?

That’s rough about the timing after your Thursday migration, @timbo72.

shub.chat · November 13, 2018, 8:13am

All looks to be fine now. Finally have been able to get epochs going

timbo72 · November 13, 2018, 8:51am

Bah! I had it going for an hour or so at lunch today but now I’m down again.
I started a Vm in Singapore (Asia-Southeast1-c) with a backup in LA (us-west2-b) and i’m being very careful to push everything to github so I can fallback to paperspace.

maral · November 13, 2018, 1:35pm