Inference performance for camera trap photos 🐘 on RPi4 - Fast.ai vs PyTorch

Hi, I’m developing a smart camera trap that is able to recognise humans and different types of animals. This is going to be used in anti-poaching and bio-diversity projects. I already have a fast.ai model that is properly trained, now I want to run it on the Raspberry Pi 4. I’ve created two working solutions for inferencing, Python code below. Here are my findings:

  • Inference with Fast.ai: 12 seconds per image
  • Inference with PyTorch: 1.5 seconds per image

These results are using exactly the same model. As you can see there is a huge difference in performance. I’m probably doing some stuff wrong in the PyTorch code, for example the transformation / normalisation are just samples I grabbed from the internet. But still, can anybody elaborate on why fast.ai inferencing is so much slower and can you help in finding the best tradeoff between inferencing speed and accuracy? :elephant: and :rhinoceros:’s are thanking you for your help :slight_smile:

Fast.ai sample code:

from fastai.vision import *
import time
from PIL import ImageFile

learn = load_learner('', 'camera_trap_model.pkl')

image = open_image(sys.argv[1])
image = image.resize(400)

for _ in range(4):
        start = time.time()
        res = learn.predict(image)
        print(res, time.time() - start)

Output on Rpi4

(Category Elephant_African, tensor(10), tensor([...])) 12.383612632751465
(Category Elephant_African, tensor(10), tensor([...])) 12.633440971374512
(Category Elephant_African, tensor(10), tensor([...])) 12.447107553482056
(Category Elephant_African, tensor(10), tensor([...])) 12.01491665840149

PyTorch sample code:

import torch
import sys
from torchvision import models
from PIL import Image
from pprint import pprint
import time

state = torch.load('camera_trap_model.pkl', map_location='cpu')
model = state.pop('model')
model.eval()

classes = ['Bird',
 'Blank',
 'Buffalo_African',
 'Cat_Golden',
 'Chevrotain_Water',
 'Chimpanzee',
 'Civet_African_Palm',
 'Duiker_Blue',
 'Duiker_Red',
 'Duiker_Yellow_Backed',
 'Elephant_African',
 'Genet',
 'Gorilla',
 'Guineafowl_Black',
 'Guineafowl_Crested',
 'Hog_Red_River',
 'Human',
 'Leopard_African',
 'Mandrillus',
 'Mongoose',
 'Mongoose_Black_Footed',
 'Monkey',
 'Pangolin',
 'Porcupine_Brush_Tailed',
 'Rail_Nkulengu',
 'Rat_Giant',
 'Rodent',
 'Squirrel']

from torchvision import transforms
transform = transforms.Compose([            
 transforms.Resize(256),                    
 transforms.CenterCrop(224),                
 transforms.ToTensor(),                     
 transforms.Normalize(                      
 mean=[0.485, 0.456, 0.406],                
 std=[0.229, 0.224, 0.225]                  
 )])

start = time.time()
image = Image.open(sys.argv[1])
print("loaded image", time.time() - start)

start = time.time()
img_t = transform(image)
print("transformed image", time.time() - start)

start = time.time()
batch_t = torch.unsqueeze(img_t, 0)
print("unsqueezed image", time.time() - start)

for x in range(4):
  start = time.time()
  out = model(batch_t)
  _, index = torch.max(out, 1)
  percentage = torch.nn.functional.softmax(out, dim=1)[0] * 100
  print(classes[index[0]], percentage[index[0]].item(), time.time() - start)

Output on Rpi4

loaded image 0.016829729080200195
transformed image 0.39249253273010254
unsqueezed image 0.00021958351135253906
Elephant_African 93.3268051147461 1.7025566101074219
Elephant_African 93.3268051147461 1.545377254486084
Elephant_African 93.3268051147461 1.6150736808776855
Elephant_African 93.3268051147461 1.4821553230285645
2 Likes

It’s weird. It may be you are loading the model on gpu with load_learner. Do you set defaults.device to ‘cpu’ and check it that model is loaded in the cpu and not on the gpu?

I don’t remember if you could do directly defaults.device='cpu' or you need to import something.

I just tried adding defaults.device = torch.device('cpu') does not make a difference, it still takes 12 seconds. Additionally I remember Jeremy saying inferencing by default uses the CPU.

When I do print(torch.cuda.get_device_name(0)) I get the following error: "Torch not compiled with CUDA enabled". So I would assume using the GPU on the RPi is not even possible, right?

1 Like

Yes, the Raspberry Pi GPU is not an NVIDIA GPU and is therefore not CUDA enabled.

It looks like you are running fastai model on image = image.resize(400) 400x400 image and pytorch on 224x224 image, according to transforms in the code. Try comparing fastai performance on 224 size image

I just tried using resize(224), this does not make a difference. I even removed the resize operation and I still get the same results.

My assumption is that (regardless of the resize) fast.ai will do transformations on the image prior to feeding it to the input layer to match the input layer size. Is my assumption correct? May that’s also what is causing the bad performance, because fast.ai is doing all sorts of transformations to the image inside the for loop?

One thing you can do is replace PIL with PIL SIMD which should speed up the Resize functionality. Also, slowly remove the fastai code and replace it with raw PyTorch (IE use fastai to build the DataLoader, but then use raw PyTorch to feed it to a model and convert it to the output you want), you should be able to speed it up via this method

While this is in fastai2, I have an example of what I’m generally talking about here: Speeding Up fastai2 Inference - And A Few Things Learned

2 Likes

Yes! I found what the difference is! After ploughing through the fast.ai source code I found out that regardless of the size of the image you pass to the predict function, fast.ai always resized the input tensor to ([1, 3, 576, 768]). This is the size the model I’m using was transfer-learned on (based on resnet50). In my pytorch example I used an input tensor of ([1, 3, 224, 224]), that’s why it was so much faster.

Now I’m trying to understand how these models work with different input sizes… I assumed a model has a fixed input size (for example 224*224). But apparently I can run 224x224 and 576x768 through the model and it just works…

Yup! :slight_smile: you can actually do a method called “Progressive Resizing” (Jeremy’s brought it up many times during the course), where you train at say 128, then 224, then 448 and it can boost accuracy

Yes, Ive heard him say it indeed. Aside from taking Jeremy’s course (I’m half way through) I’ve watched many videos that explain how CNN’s work, and I can follow this along. But all these videos talk about a fixed input size.

Now I’m trying to grasp how “Progressive Resizing” works during training and inferencing. Any good material on that?

1 Like

Test-time discrepancy I believe is what it’s called. There’s just now been some research exploring this area. From my knowledge here’s what I can tell you:

TTA (or test time augmentation) can boost accuracy but is 5x as long (as it uses training augmentation)

Progressive resizing: done during training and then you use your final size during inference (IE 448 if we finished at 448), however it’s been known you can boost accuracy a little more by using a slightly higher image size (like 512 in this scenario)

Hi all,
Just wanted to add an observation from training the mentioned model:

I have tried progressive resizing with 128 -> 256 -> 512 stages (which is a standard setup I have seen and used myself), but found that it worked worse than 512 -> 768 in this case.

Long story short, the dataset at hand is very rich with information: there is a lot of foreground coming from all the vegetation visible in the images, and the animals we are detecting are sometimes rather small in the picture. Hence, higher resolutions than the usual approach suggests, give better results.

1 Like

Hi @tsuijten.

I have been developing a similar system for my final year project to classify baboons in South Africa for tracking purposes. I have trained my model (Resnet34) in Colab and exported the pickle model onto a raspberry pi 4 virtual environment. I’ve installed the fastai library aswell as PyTorch libraries, however when calling the function load_learner() I get a error. It seems like the function was not installed into my environment. Could you perhaps give me some direction in how you ensured that your libraries were installed correctly or any other advice would be greatly appreciated.

Thanks a million :slight_smile: