GCP model training

Since last 2 days I have not been able to run a single epoch on GCP, suddenly the time per epoch has doubled and somehow keeps on increasing. Has someone observed the same?

This seems to be an ongoing issue for many people. I’ve been completely unable to provision a machine with a GPU since Friday. Good thing I spent some time on Thursday migrating all my work from paperspace.

I am able to get an instance but no luck in training any epoch.

On Sat & Sun, I was unable to either restart my existing GCP GPU instance or start up a new one. I tried in US West & Australia.

Apparently this was a worldwide problem for GCP. They reported it as a Kubernetes problem, but according to a thread on Hacker News, other services were also having problems.

Yesterday I was able to restart my instance & start up a new one, so I’m guessing it’s all working now?

That’s rough about the timing after your Thursday migration, @timbo72.

All looks to be fine now. Finally have been able to get epochs going

Bah! I had it going for an hour or so at lunch today but now I’m down again.
I started a Vm in Singapore (Asia-Southeast1-c) with a backup in LA (us-west2-b) and i’m being very careful to push everything to github so I can fallback to paperspace.