How do you handle “continuous learning” in production ?. Calculation of model drift , retraining of model. Is there any reference architecture that can be suggested .
thanks
Hari
How do you handle “continuous learning” in production ?. Calculation of model drift , retraining of model. Is there any reference architecture that can be suggested .
thanks
Hari
How do you run them periodically?
Ideally, I’d like to have something that periodically pings my web application for new content to run multiple models against. If there is not content, it would grab all of it, run the model(s), and then return all the predicated labels back to the web application in a single call.
Also, trying to reduce my AWS costs here. Its not like this thing is going to be getting new content throughout the day. It may be only 5-20x a day and so I’d hate to just have something running in AWS, consuming resources, when it doesn’t need too most of the time.
And also, how do you have things running in Sagemaker to get and run models against a “batch” of data? Most of what I read, I just see examples of using it for single item inference.
Thanks
Let me try to separate your response into distinct items:
From the above, I’ll try to answer them one by one.
Yah, I’m actually looking at using ECS (elastic container service) and EI (elastic inference).
ECS allows me to deploy via docker images, in which, I can run a cron job that periodically queries my web application for new content to do inference on.
EI, not available for PyTorch yet (insofar as I can tell), basically allows you to attach a GPU to your cheaper EC2 instance running your docker image only when you need it.
Thanks again for the info … really helpful to see how other folks are utilizing AWS to production.
Thanks for the information on the EI. Never really explored that option. Would be great to explore once we have a requirement on batch inference on GPU.
Edit: Fixed, found the solution. The pickled learner was performing preprocessing (scaling by 8x) on the inputs when .predict was called, and I have to match that exactly when calling .trace() and when running .forward() on the new module. The nightly version of pytorch indeed lets this work on DynamicUnet.
I’m trying to convert a DynamicUnet (https://docs.fast.ai/vision.models.unet.html#DynamicUnet) for production. I’ve tried using TorchScript tracing, but it ends up using over 40+ GB of memory (when tracing is done on the CPU! GPU just crashes), when normally predicting with it uses next to none. Has anyone encounted any fixes for this?
I’m running it like this:
learner = load_learner(os.path.dirname(model_fn), os.path.basename(model_fn))
dummy_img = pt.ones(1, 3, 3264, 2448).cuda()
jit_model = pt.jit.trace(learner.model, dummy_img)
pt.jit.save(jit_model, output_model_fn)
Does anyone have any ideas on why DynamicUnet in particular is blowing up? Calling learner.predict is working fine and uses very little memory.
I’ve been using the nightly of FastAI as the current 1.2.0 has issues with hooks or something which were solved, though you can also manually edit out one line to fix the issue of hooks.
I’m wondering if the DynamicUNet naturally splits the input into tiles to run and somehow torch.jit.trace isn’t doing the same.
More discussion I’ve found on this:
Edit: Fixed, found the solution. The pickled learner was performing preprocessing (scaling by 8x) on the inputs when .predict was called, and I have to match that exactly when calling .trace() and when running .forward() on the new module. The nightly version of pytorch indeed lets this work on DynamicUnet.
Using TorchScript crashed on both CPU and GPU? For me it works fine on GPU, but no matter how small I set my inputs to for the unet it still crashes on CPU.
Hi everyone,
I was telling to a friend of mine who is a great SW engineer how easy it has become to create state-of-the-art models thanks to fastai, but how much I still struggle with models productionizing (especially those that are not classification). One thing leading to another, we decided to work together on our spare time in order to help fastai model deployment become easier.
In order to focus on the right issues, we wanted to start with a poll. What are your main bottlenecks today when deploying your models?
0 voters
I’ve been playing with fast.ai libraries and Sagemaker (AWS) since for the past few months, and found that I was quickly able to build a collaborative learner model able to make meaningful recommendations to our clients. In order to bring this to production for a minimal cost, I was imagining deploying the model to Lambda where it could provide the predictions and while maintaining the model weekly via Sagemaker.
But now I wonder:
I see the answer to #2 is provided here:https://course.fast.ai/deployment_amzn_sagemaker.html.
How we can do it for unet_learner?
data_path = Path('/home/data/')
codes = np.loadtxt(data_path/"codes.txt", dtype=str); codes
label = data_path/"mask"
image = data_path/"input"
get_y_fn = lambda x: label/f'{x.stem}.jpg'
size = (512,512)#src_size//2
print(size)
bs=8
class SegLabelListCustom(SegmentationLabelList):
def open(self, fn): return open_mask(fn, div=True)
class SegItemListCustom(ImageList):
_label_cls = SegLabelListCustom
src = (SegItemListCustom.from_folder(image)
.split_by_rand_pct(0.2)
.label_from_func(get_y_fn, classes=codes) )
data = (src.transform(get_transforms(), size=size, tfm_y=True )
.databunch(bs=bs)
.normalize(imagenet_stats))
name2id = {v:k for k,v in enumerate(codes)}
print(name2id)
# void_code = name2id['Void']
def acc_camvid(input, target):
target = target.squeeze(1)
# mask = target != void_code
return (input.argmax(dim=1)==target).float().mean()
metrics=acc_camvid
# metrics=accuracy
wd=1e-2
learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)
learn.load("./models/stage-2-big");
img = open_image("test.jpg")
prediction = learn.predict(img)[0]
prediction #how to convert it to numpy image?
I have to all this for loading the learner for prediction in the flask application because of load_learner() from .pkl file is not working. Please provide an elegant way of doing this for a single image. After getting the prediction, I am not able to convert it to numpy or OpenCV image format. And If I am trying to save the image, it is saving it a blank image. Please help asap!!
My issue is that I never came to data science (and from there here) via a programming background. I do know a lot of data science programming but zero webstack or cloud. All the deployment guides seem to assume you ‘know’ webstack and cloud and work with it regularly. I don’t mind at all learning these things but I don’t want to learn them in their entirety - just “webstack/cloud for deep learning.” Tools to ‘go faster’ will help me nothing because as I learned when setting up Google cloud there are many things that always ‘go wrong’ or have changed etc and these roadblocks will be worse with an efficiency tool. Rather, I need to know the high level hoops I have to jump through and a list of the things I need to understand or try to jump through them. Thanks.
After scouring the web and forums for two days trying to figure out how to do this I have now discovered fastai have guides to put models into production, which I never noticed before. Doh!
Hi Michael !
Thanks a lot for your reply. I completely understand and share your perspective, as I also a maths background but almost 0 experience with web/cloud dev.
The Fastai tutorials you’ve just found are incredibly helpful for putting your first model in production, so hopefully it may help you solve some of your problems. But in my experience, the process to deploy is still tedious (especially if you want to use AWS lambda or Azure Functions), and many more advanced stuff is not covered by the tutorials (for example, deploying a web API, deploying on mobile, etc…).
Thanks,
Seb
I created a complete example of deployment fastai model in c++ application. There are some parts of the process in the forum. But I think it would be useful to have all of it in one place.
Have you found the solution? I’m also facing the same problem
I did not. I installed fastai and other packages on my customer’s machine. To make it easier, I created an environment.yml file.
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
All,
I’m attempting to follow the aws_lambda guide. https://course.fast.ai/deployment_aws_lambda.html Upon initial training, I have 3 models and data totaling about 800 megs compressed, and 1GB uncompressed.
One of the steps is to use Torchscript to simplify/streamline the model. The example they give is with a CNN. I’m attempting to deploy a variant of the IMDB ULMFiT implementation that they worked with in lessons 3 and 4. Clearly, the deployment guide is set to trace out a CNN.
I’m new to fast.ai, and discovered that torchscript existed yesterday when I started trying to follow the lambda guide described above. I’m still trying to get a sense of torchscript… the “gentle” guides are still over my head at this point. I’m honestly surprised that torchscript didn’t make the cut to be included in the 12 lessons of the fast.ai course. That feels like a bit of an oversight.
Can somebody point me in a good direction to begin understanding how to modify the CNN script for an RNN? A code sample would be amazing, but failing that, documentation/training would be great.
Thank you!
(error included below)
2 jit_model = torch.jit.trace(learn.model.float(), trace_input)
3 model_file=‘resnet50_jit.pth’
4 output_path = str(path_img/f’models/{model_file}’)
5 torch.jit.save(jit_model, output_path)
7 frames
/usr/local/lib/python3.6/dist-packages/fastai/text/learner.py in forward(self, input)
257
258 def forward(self, input:LongTensor)->Tuple[Tensor,Tensor]:
–> 259 bs,sl = input.size()
260 self.reset()
261 raw_outputs,outputs,masks = [],[],[]
ValueError: too many values to unpack (expected 2)
Ran into this recently and found a solution from this blog post:
You’ll need to specify fastprogress as a hiddenimport in your spec file, specify the location of your hook files, and then create a new hook file with the following content:
from PyInstaller.utils.hooks import copy_metadata
datas = copy_metadata('fastprogress')
Be sure to name it hook-fastprogress.py
For more information check out PyInstaller’s documentation regarding hooks and hidden imports.
All,
I am looking for best practices in measuring concept drift . Can some body share how do you handle concept drift while dealing with text and images.
Any good papers that you can suggest. thanks.