Issues with deployment on google app engine example

#3

Have you managed to deploy an app on google cloud? My deployment crashes at:

Updating service [default] (this may take several minutes)...failed.

When I looked at the logs, I found that:

_check_disk_space.sh: Free disk space (792 MB) is lower than threshold value 954 MB. Reporting instance permanently unhealthy. Note: different instances will have a different threshold value.

0 Likes

(Arkar Aung) #4

Hey @endrju, I could successfully deploy on google app engine. From the looks of your error, I guess it has something to do with google app engine instance itself.

I had one hiccup when deploying on Google App Engine. I re-ran the command gcloud app deploy for the second time and it worked. What you could also do is delete the app from google app engine and deploy again.

0 Likes

(Akshai Rajendran) #5

Hey @ark_aung, I’m currently trying to deploy on Google App Engine and having difficult installing fastai due to what appears to be a memory error. I’m assuming your deployment involved installing the fastai module right? If so, how were you able to do that?

0 Likes

(Arkar Aung) #6

Hey @arajendran,

Do you mind attaching the memory error message that you are getting?

0 Likes

(Akshai Rajendran) #7

@ark_aung Thanks for the reply. When building with pip I just see “Killed” during the download/build of torch. When deploying I see “oserror cannot allocate memory” in the error log. I’ve switched from the Standard environment to Flexible and that seems to have solved the memory issue, though I’m still working through a few other errors. Since the runtime is custom however I’m currently trying to get logging to work.

0 Likes

(Arkar Aung) #8

When you create your instance, did you make sure that you choose the right image?

Here, the IMAGE_FAMILY makes sure that there are all the stuff we would need for Fast-AI Pytorch.

export IMAGE_FAMILY="pytorch-latest-gpu" # or "pytorch-latest-cpu" for non-GPU instances
export ZONE="us-west2-b" # budget: "us-west1-b"
export INSTANCE_NAME="my-fastai-instance"
export INSTANCE_TYPE="n1-highmem-8" # budget: "n1-highmem-4"

# budget: 'type=nvidia-tesla-k80,count=1'
gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator="type=nvidia-tesla-p4,count=1" \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=200GB \
        --metadata="install-nvidia-driver=True" \
        --preemptible

Ref: https://course.fast.ai/start_gcp.html

0 Likes

(Akshai Rajendran) #9

Well I’m deploying this on Google App Engine which is separate from my Google Compute instance where I’m actually training the models, if I understand it correctly. I’ve actually been able to get it to work after making some changes to the code provided in the deployment guide git. I’m deploying via Flask which is the main reason I didn’t just follow the guide straight through. For future reference if it helps anyone, the primary change was changing the Dockerfile to the following (server is run from main.py located in app sub-folder):

FROM python:3.6-slim-stretch

RUN apt update

RUN apt install -y python3-dev gcc

ADD requirements.txt requirements.txt

RUN pip install -r requirements.txt

COPY app app/

EXPOSE 8080

CMD ["gunicorn", "-b", ":8080", "--chdir", "app/", "main:app"]
1 Like

(Joaquin Maya) #10

Hi, I’m trying to deploy an app using google app engine and I am having this error

0 Likes

(David) #11

Did you ever get it to work?

0 Likes

(Derek) #12

Hey - where did you actually make this change in the code? I tried to change the response from learn.predict to a string but it still doesn’t seem to be working, maybe I am doing something incorrect.

0 Likes

(Arkar Aung) #13

This is the change that I made JSONResponse({'result':str(learn.predict(img)[0])}. Perhaps you did this and still things are not working. Show me your code and your error.

0 Likes

(Derek) #14

That’s exactly what I did, but I just realized I didn’t restart the server when testing (womp). Just restarted the server and it’s all working now, thanks for offering your help though.

0 Likes

(Amit) #15

Hi,

Kind of bit late but issue is pretty much the same that is mentioned here. So far have :

a) changed tfms to ds_tfms
b) changed learner to cnn_learner
c) updated return stmt in analyze to return JSONResponse({‘result’: learn.predict(img)[0].obj})

Now error is :

Traceback (most recent call last):
File “app/server.py”, line 36, in
learn = loop.run_until_complete(asyncio.gather(*tasks))[0]
File “/usr/local/lib/python3.6/asyncio/base_events.py”, line 484, in run_until_complete

Have shutdown and restored the GAE instance before the last change. Everytiem wiping out git directory and cloning after every update.

Am now going to try the docker file change suggested above.

Kindly help.

0 Likes

(Amit) #16

Made bit of progress.

No more build errors.

Now its complaining about time out:

Updating service [default] (this may take several minutes)…failed.
ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting th
e ‘app_start_timeout_sec’ setting in the ‘readiness_check’ section.

Added with value of 600 but still same error.

Will keep checking…

Any ideas?

0 Likes

(Amit) #17

Hi,

Still no joy, tried with a clean project, enabled billing, and ran steps.

Error is still the same … timeout…

Updating service [default] (this may take several minutes)…failed.
ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting
the ‘app_start_timeout_sec’ setting in the ‘readiness_check’ section.

Please help.

0 Likes

Problem deploying app using gcloud app engine
(Yev) #20

Using this git

 https://github.com/pankymathur/google-app-engine

and changing server.py inputs:
a) changed tfms to ds_tfms
b) updated return stmt in analyze to return JSONResponse({‘result’: learn.predict(img)[0].obj})
c) putting my own dropbox link for model
d) changing prediction classes to my models prediction classes…

and then firgured out my “size mismatch for 0.4.0.conv1.weight:” error was due to loading resnet34 instead of resnet 50 (which I trained), so I changed model download from resnet34 to resnet50, but now facing another issue/error:

Step 7/9 : RUN python app/server.py
—> Running in e16ff31a58db
Traceback (most recent call last):
File “app/server.py”, line 9, in
from fastai.vision import *
File “/usr/local/lib/python3.6/site-packages/fastai/vision/init.py”, line 5, in
from .data import *
File “/usr/local/lib/python3.6/site-packages/fastai/vision/data.py”, line 4, in
from .transform import *
File “/usr/local/lib/python3.6/site-packages/fastai/vision/transform.py”, line 233, in
_solve_func = getattr(torch, ‘solve’, torch.gesv)
AttributeError: module ‘torch’ has no attribute ‘gesv’
The command ‘/bin/sh -c python app/server.py’ returned a non-zero code: 1
ERROR
ERROR: build step 0 “gcr.io/cloud-builders/docker” failed: exit status 1

then changed dockefile per arajendran’s post to

FROM python:3.6-slim-stretch
RUN apt update
RUN apt install -y python3-dev gcc
ADD requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY app app/
EXPOSE 8080
CMD [“gunicorn”, “-b”, “:8080”, “–chdir”, “app/”, “main:app”]

but the app fails to deploy, due to error:

ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting
the ‘app_start_timeout_sec’ setting in the ‘readiness_check’ section.

I looked into this error:

  1. https://stackoverflow.com/questions/46127236/are-updated-health-checks-causing-app-engine-deployment-to-fail

  2. https://stackoverflow.com/questions/46127236/are-updated-health-checks-causing-app-engine-deployment-to-fail

  3. https://issuetracker.google.com/issues/65500706

So i am not sure if the app actually deploys with Docker code specified above.
I think we have to run /add this line to docker:
RUN python app/server.py

But when we do, the following error.
I think the main error is this now:
Step 7/9 : RUN python app/server.py
—> Running in a9d5c8b2540e
Traceback (most recent call last):
File “app/server.py”, line 8, in
from fastai.vision import *
File “/usr/local/lib/python3.6/site-packages/fastai/vision/init.py”, line 5, in
from .data import *
File “/usr/local/lib/python3.6/site-packages/fastai/vision/data.py”, line 4, in
from .transform import *
File “/usr/local/lib/python3.6/site-packages/fastai/vision/transform.py”, line 233, in
_solve_func = getattr(torch, ‘solve’, torch.gesv)
AttributeError: module ‘torch’ has no attribute ‘gesv’
The command ‘/bin/sh -c python app/server.py’ returned a non-zero code: 1
ERROR
ERROR: build step 0 “gcr.io/cloud-builders/docker” failed: exit status 1

1 Like

(Arkar Aung) #21

Your problem is at line
_solve_func = getattr(torch, ‘solve’, torch.gesv)

torch.gesv has been deprecated. Try changing that to torch.solve

From: https://github.com/pytorch/pytorch/releases

Removed Use Instead
btrifact lu
btrifact_with_info lu with get_infos=True
btrisolve lu_solve
btriunpack lu_unpack
gesv solve
pstrf cholesky
potrf cholesky
potri cholesky_inverse
potrs cholesky_solve
trtrs triangular_solve
2 Likes

(Andres) #22

Thanks, I also had the same problem.
If I manually make the change, the error disappears.
However, I’m using fastai from anaconda and the latest version there is 1.0.55, which does not have the change in fastai/vision/transform.py (1.0.56 does).
Not sure if 1.0.56 is available in the app engine (didn’t try yet).

1 Like

(Yev) #23

First of all thank you for replying!
So are you saying that I should go to source code and change this line:
“_solve_func = getattr(torch, ‘solve’, torch.gesv)”

Manually? I think I would face the same issues andres is facing.
See his answer below mine.

I tried / trying another path, but while, faced another error, maybe you can help.
So I went to my VM JupyterHub.
Checked ! pip install list.
And transformed my requirments.txt file to mimic what ! pip install returned.
So I know I have a perfect match of what I am deploying vs what I trained on.

So I ended up with the following requirements doc:"

‘’’
numpy==1.16.2
torchvision==0.3.0
https://download.pytorch.org/whl/cpu/torch-1.1.0-cp37-cp37m-linux_x86_64.whl
fastai==1.0.55
starlette==0.11.4
uvicorn==0.3.32
python-multipart
aiofiles==0.4.0
aiohttp==3.5.4

‘’’

And here is my current docker file

‘’’
FROM python:3.6-slim-stretch

RUN apt update

RUN apt install -y python3-dev gcc

ADD requirements.txt requirements.txt

RUN pip install -r requirements.txt

COPY app app/

RUN python app/server.py

EXPOSE 8080

CMD [“python”, “app/server.py”, “serve”]

#CMD [“gunicorn”, “-b”, “:8080”, “–chdir”, “app/”, “main:app”]

#RUN python app/server.py
#gcloud app update --no-split-health-check

‘’’

However, when I tried to launch I got the following error:

‘’’
Step 5/9 : RUN pip install -r requirements.txt
—> Running in 0132b685c40c
ERROR: torch-1.1.0-cp37-cp37m-linux_x86_64.whl is not a supported wheel on this platform.
WARNING: You are using pip version 19.2.1, however version 19.2.2 is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.
The command ‘/bin/sh -c pip install -r requirements.txt’ returned a non-zero code: 1
ERROR
ERROR: build step 0 “gcr.io/cloud-builders/docker” failed: exit status 1

‘’’

That was due to =
https://download.pytorch.org/whl/cpu/torch-1.1.0-cp37-cp37m-linux_x86_64.whl

trying to install version for Python 3.7 (CP37) vs Docker file specifying Python 3.6

So I changed it to:
https://download.pytorch.org/whl/cpu/torch-1.1.0-cp36-cp36m-linux_x86_64.whl

And now facing this error.

from torchvision import _C
ImportError: libcudart.so.9.0: cannot open shared object file: No such file or directory
The command ‘/bin/sh -c python app/server.py’ returned a non-zero code: 1
ERROR
ERROR: build step 0 “gcr.io/cloud-builders/docker” failed: exit status 1

Looked into it, and it has to do with CUDA versioning…
Looks like this will never end.
I am giving up for now.

PS.
The app did deploy on render…
So an easier way out.

Here is my GIT if it helps anyone here:

0 Likes

(Andres) #24

The timeout error for me is solved by increasing the requested disk space in the app.yaml file, like this:

resources:
    disk_size_gb: 12

12 GB seems to be enough for me.

0 Likes