Productionizing models thread

How do you handle “continuous learning” in production ?. Calculation of model drift , retraining of model. Is there any reference architecture that can be suggested .

thanks
Hari

1 Like

How do you run them periodically?

Ideally, I’d like to have something that periodically pings my web application for new content to run multiple models against. If there is not content, it would grab all of it, run the model(s), and then return all the predicated labels back to the web application in a single call.

Also, trying to reduce my AWS costs here. Its not like this thing is going to be getting new content throughout the day. It may be only 5-20x a day and so I’d hate to just have something running in AWS, consuming resources, when it doesn’t need too most of the time.

And also, how do you have things running in Sagemaker to get and run models against a “batch” of data? Most of what I read, I just see examples of using it for single item inference.

Thanks

Let me try to separate your response into distinct items:

  1. How do you run them periodically?
  2. How do you have things running in Sagemaker to get and run models against a “batch” of data?
  3. Reduce cost.

From the above, I’ll try to answer them one by one.

  1. We use a python notebook running in Databricks - which has a scheduler feature. Compiled results are then sent to S3. Maybe you can try using cronjob/luigi/airflow?
  2. Technically we are still calling things in single item reference. It’s just that from Databricks side (as mentioned in #1), we process the entire set in one, long run (as in, we make many single calls to the API). Since it’s being run periodically, I call it batch. ¯_(ツ)_/¯. I guess you’re actually referring to this perhaps --> (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html) . We haven’t tried this yet.
  3. If I’m really serious about saving cost, I’d probably just host my model in ec2/beanstalk instead of SageMaker - the latter’s instance is usually double the price.

Yah, I’m actually looking at using ECS (elastic container service) and EI (elastic inference).

ECS allows me to deploy via docker images, in which, I can run a cron job that periodically queries my web application for new content to do inference on.

EI, not available for PyTorch yet (insofar as I can tell), basically allows you to attach a GPU to your cheaper EC2 instance running your docker image only when you need it.

Thanks again for the info … really helpful to see how other folks are utilizing AWS to production.

1 Like

Thanks for the information on the EI. Never really explored that option. Would be great to explore once we have a requirement on batch inference on GPU. :+1:

Edit: Fixed, found the solution. The pickled learner was performing preprocessing (scaling by 8x) on the inputs when .predict was called, and I have to match that exactly when calling .trace() and when running .forward() on the new module. The nightly version of pytorch indeed lets this work on DynamicUnet.

I’m trying to convert a DynamicUnet (https://docs.fast.ai/vision.models.unet.html#DynamicUnet) for production. I’ve tried using TorchScript tracing, but it ends up using over 40+ GB of memory (when tracing is done on the CPU! GPU just crashes), when normally predicting with it uses next to none. Has anyone encounted any fixes for this?

I’m running it like this:

    learner = load_learner(os.path.dirname(model_fn), os.path.basename(model_fn))
    dummy_img = pt.ones(1, 3, 3264, 2448).cuda()
    jit_model = pt.jit.trace(learner.model, dummy_img)
    pt.jit.save(jit_model, output_model_fn)

Does anyone have any ideas on why DynamicUnet in particular is blowing up? Calling learner.predict is working fine and uses very little memory.

I’ve been using the nightly of FastAI as the current 1.2.0 has issues with hooks or something which were solved, though you can also manually edit out one line to fix the issue of hooks.

I’m wondering if the DynamicUNet naturally splits the input into tiles to run and somehow torch.jit.trace isn’t doing the same.

More discussion I’ve found on this:


Edit: Fixed, found the solution. The pickled learner was performing preprocessing (scaling by 8x) on the inputs when .predict was called, and I have to match that exactly when calling .trace() and when running .forward() on the new module. The nightly version of pytorch indeed lets this work on DynamicUnet.

Using TorchScript crashed on both CPU and GPU? For me it works fine on GPU, but no matter how small I set my inputs to for the unet it still crashes on CPU.

Hi everyone,

I was telling to a friend of mine who is a great SW engineer how easy it has become to create state-of-the-art models thanks to fastai, but how much I still struggle with models productionizing (especially those that are not classification). One thing leading to another, we decided to work together on our spare time in order to help fastai model deployment become easier.

In order to focus on the right issues, we wanted to start with a poll. What are your main bottlenecks today when deploying your models?

  • It takes me too long
  • I struggle with models that are not classification (e.g. img2img)
  • Deploy models that run locally on mobile devices
  • Deploy/use your model as a web API
  • Managing the model once it is deployed (getting analytics, alerts, cost control, etc…)
  • Others (please explain what)

0 voters

1 Like

I’ve been playing with fast.ai libraries and Sagemaker (AWS) since for the past few months, and found that I was quickly able to build a collaborative learner model able to make meaningful recommendations to our clients. In order to bring this to production for a minimal cost, I was imagining deploying the model to Lambda where it could provide the predictions and while maintaining the model weekly via Sagemaker.

But now I wonder:

  1. Is there guidance, or would someone be willing to provide guidance, for migrating a collaborative learner model to pytorch for Lambda? I only see guidance for classification models.
  2. Definite newbie question: I presume I need/want to migrate my Jupyter notebook to python code. Is there guidance, or would someone be willing to provide guidance, for that?
  3. Is this implementation overly complex? That is, would it be relatively similar in cost to spin up Sagemaker to make the recommendations as needed (say weekly) and then bring it down again?

I see the answer to #2 is provided here:https://course.fast.ai/deployment_amzn_sagemaker.html.

1 Like

How we can do it for unet_learner?

    data_path = Path('/home/data/')
    codes = np.loadtxt(data_path/"codes.txt", dtype=str); codes
    label = data_path/"mask"
    image = data_path/"input"

    get_y_fn = lambda x: label/f'{x.stem}.jpg'

    size = (512,512)#src_size//2
    print(size)
    bs=8

    class SegLabelListCustom(SegmentationLabelList):
        def open(self, fn): return open_mask(fn, div=True)
    
    class SegItemListCustom(ImageList):
         _label_cls = SegLabelListCustom

    src = (SegItemListCustom.from_folder(image)
           .split_by_rand_pct(0.2)
           .label_from_func(get_y_fn, classes=codes) )

    data = (src.transform(get_transforms(), size=size, tfm_y=True )
             .databunch(bs=bs)
             .normalize(imagenet_stats))

    name2id = {v:k for k,v in enumerate(codes)}
    print(name2id)
    # void_code = name2id['Void']

    def acc_camvid(input, target):
        target = target.squeeze(1)
    #     mask = target != void_code
        return (input.argmax(dim=1)==target).float().mean()

    metrics=acc_camvid
    # metrics=accuracy

    wd=1e-2

    learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)

    learn.load("./models/stage-2-big");

    img = open_image("test.jpg")

    prediction = learn.predict(img)[0]

    prediction #how to convert it to numpy image?

I have to all this for loading the learner for prediction in the flask application because of load_learner() from .pkl file is not working. Please provide an elegant way of doing this for a single image. After getting the prediction, I am not able to convert it to numpy or OpenCV image format. And If I am trying to save the image, it is saving it a blank image. Please help asap!!

My issue is that I never came to data science (and from there here) via a programming background. I do know a lot of data science programming but zero webstack or cloud. All the deployment guides seem to assume you ‘know’ webstack and cloud and work with it regularly. I don’t mind at all learning these things but I don’t want to learn them in their entirety - just “webstack/cloud for deep learning.” Tools to ‘go faster’ will help me nothing because as I learned when setting up Google cloud there are many things that always ‘go wrong’ or have changed etc and these roadblocks will be worse with an efficiency tool. Rather, I need to know the high level hoops I have to jump through and a list of the things I need to understand or try to jump through them. Thanks.

After scouring the web and forums for two days trying to figure out how to do this I have now discovered fastai have guides to put models into production, which I never noticed before. Doh!

Hi Michael !

Thanks a lot for your reply. I completely understand and share your perspective, as I also a maths background but almost 0 experience with web/cloud dev.

The Fastai tutorials you’ve just found are incredibly helpful for putting your first model in production, so hopefully it may help you solve some of your problems. But in my experience, the process to deploy is still tedious (especially if you want to use AWS lambda or Azure Functions), and many more advanced stuff is not covered by the tutorials (for example, deploying a web API, deploying on mobile, etc…).

Thanks,
Seb

I created a complete example of deployment fastai model in c++ application. There are some parts of the process in the forum. But I think it would be useful to have all of it in one place.

2 Likes

Have you found the solution? I’m also facing the same problem

I did not. I installed fastai and other packages on my customer’s machine. To make it easier, I created an environment.yml file.

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html

All,

I’m attempting to follow the aws_lambda guide. https://course.fast.ai/deployment_aws_lambda.html Upon initial training, I have 3 models and data totaling about 800 megs compressed, and 1GB uncompressed.

One of the steps is to use Torchscript to simplify/streamline the model. The example they give is with a CNN. I’m attempting to deploy a variant of the IMDB ULMFiT implementation that they worked with in lessons 3 and 4. Clearly, the deployment guide is set to trace out a CNN.

I’m new to fast.ai, and discovered that torchscript existed yesterday when I started trying to follow the lambda guide described above. I’m still trying to get a sense of torchscript… the “gentle” guides are still over my head at this point. I’m honestly surprised that torchscript didn’t make the cut to be included in the 12 lessons of the fast.ai course. That feels like a bit of an oversight.

Can somebody point me in a good direction to begin understanding how to modify the CNN script for an RNN? A code sample would be amazing, but failing that, documentation/training would be great.

Thank you!

(error included below)

2 jit_model = torch.jit.trace(learn.model.float(), trace_input)
3 model_file=‘resnet50_jit.pth’
4 output_path = str(path_img/f’models/{model_file}’)
5 torch.jit.save(jit_model, output_path)

7 frames
/usr/local/lib/python3.6/dist-packages/fastai/text/learner.py in forward(self, input)
257
258 def forward(self, input:LongTensor)->Tuple[Tensor,Tensor]:
–> 259 bs,sl = input.size()
260 self.reset()
261 raw_outputs,outputs,masks = [],[],[]

ValueError: too many values to unpack (expected 2)

Ran into this recently and found a solution from this blog post:

You’ll need to specify fastprogress as a hiddenimport in your spec file, specify the location of your hook files, and then create a new hook file with the following content:

from PyInstaller.utils.hooks import copy_metadata
datas = copy_metadata('fastprogress')

Be sure to name it hook-fastprogress.py

For more information check out PyInstaller’s documentation regarding hooks and hidden imports.

1 Like

All,

I am looking for best practices in measuring concept drift . Can some body share how do you handle concept drift while dealing with text and images.

Any good papers that you can suggest. thanks.