Lesson 11 discussion and wiki

So one thing probably is that PyTorch torchvision just uses PIL and that it is can be a bit less complex to install. Regarding the performance, there are three things to note:

  • Is preprocessing the training bottleneck? Preprocessing happens in the background with PyTorch dataloaders, so unless your model waits for the next batch, preprocessing probably is fast enough already. Homography / rotation does cost CPU cycles, cropping not really.
  • There is a SIMD drop-in PIL replacement that very likely catches up quite a bit.
  • If you really want fast preprocessing, you’d probably look at the GPU. Now that Jeremy has rehabilitated my lazyness of just using nearest neighbour, I should really put up my homography transform CUDA kernel (but it’s really trivial to do, so if you always wanted to implement a custom cuda thing, I can recommend it as a first project). :slight_smile:

Best regards

Thomas

4 Likes

The Mixup papersimilarly smelled tenchy to me - as it seems to produce a scrambled image by “convex combinations of pairs of examples and their labels” - so I guess this is a similar de-regularization.

1 Like

Noise like in dropout is believed to have an an effect like “ensemble learning” by breaking up the network in more or less de-correlated subnetworks. Its not just dropout but also other types/ amount of noise that has this effect. Considering that, it becomes more plausible that fish images without fish could be ok - in small amount.

Another angle is the mixup/mixin experience that shows that “borderline images” can help refine the boundary between classes. Ie a fisherman without fish is closes to than an astronaut -in some abstract sense :slight_smile:

4 Likes

I posted the edited lesson video to the top post.

2 Likes

Apparently there were multiprocessing incompatibilities between Python and OpenCV that made it unreliable.

1 Like

It already only loads the data into memory one batch at a time, so it has some lazy properties that make it possible to train on data larger than your RAM.

3 Likes

I’m not sure if I missed it, but I’m a bit concerned about the timing of the GPU ops. You could need to have synchronization before the measurement and at the end of the timed function if you want to use %timeit (which I personally do a lot for quick benchmarks.).

Also, I’m not sure if I would include the transfer to the GPU in the benchmark, as you’ll be transferring your image to the GPU anyways at some point, so it’s not really an overhead that you incur from the transformation.

Best regards

Thomas

2 Likes

I succeeded to run this lesson 11 on Kaggle too and I got similar results as in Jeremy’s notebooks. I compressed and imported the exp folder.

2 Likes

I also think that you should deal with some variable casting
Something like below:

What would you suggest would be the best way to write this for the example we showed?

This is already handled by CudaCallback.

2 Likes

Right - it’s not ideal. But by definition there won’t be many items in this class, so it should be OK. Trying to predict a category we’ve never seen before is always going to be tricky!..

1 Like

Write a simple and readable version. See if it’s fast enough for what you’re doing. If it’s not, use your profiler to find what’s taking the time, and fix that bit. :slight_smile:

3 Likes

And it’s documented here:
https://docs.fast.ai/performance.html#faster-image-processing

5 Likes

Interesting article on NVIDIA DALI: Data Augmentation Library

1 Like

There’s a little starter for using DALI in the course repo BTW. It is just enough to give you a sense of how to get started writing your own data-blocks-style API using DALI. I’ll probably come back to it and flesh it out in the coming weeks.

5 Likes

in sgd_step we say p.data.add_(-lr, p.grad.data). Why do we use two arguments instead of multiplying?

I’d probably not use type and to(device=..., dtype=...) instead.

Best regards

Thomas

So

%timeit -n 10 grid = F.affine_grid(theta.cuda(), x.size())

would become the (more verbose, unfortunately)

theata_cuda = theta.cuda()
def time_fn():
  grid = F.affine_grid(theta_cuda, x.size())
  torch.cuda.synchronize()

time_fn() # mini warm-up and synchronize
%timeit -n 10 time_fn()

The warm-up seems to be done generally and it gives us a torch.cuda.synchronize() so everything before out function is done when we call the function.
Then the time_fn() synchronizes to make sure we don’t read off the time before the kernel is actually done.

I guess one could make a %cuda_timeit magic to get back to the nice, short way of calling it.

4 Likes