[Adv] Return to opencv!

I’ve just pushed a change to fastai that goes back to using opencv, which reverts the change to Pillow for image processing that I made about a week ago. The problems with Pillow were that:

  • It wasn’t nearly as fast as opencv
  • It couldn’t handle arrays with anything other than 1, 3, or 4 channels

Previously, opencv was causing problems for some people with freezing due to a conflict with python multiprocessing. Therefore, I’ve now switched to using ThreadPoolExecutor, after an opencv contributor told me on twitter that he’d removed the need for the GIL in opencv.

So in my testing, this is now fast and reliable! Please let me know if you have any slowdown in data augmentation, or any new freezes or other issues.

This change should help folks in the iceberg competition, since this should in theory now handle 2-channel data augmentation.

11 Likes

It looks like the commit b1ba89663c16f81ae33220319d08a4f87988eb94 breaks the transformations. The ‘ConstantPad’ class is removed although it’s still called in image_gen function of Transforms class. ConstantPad depended on PIL’s ImageOps, which is probably the reason why it was removed. Created a pull request with a fix.

I’m facing issue with lr finder in the lesson 2 problem. Any insight on it?

I’m not sure if this is related to the update since I didn’t test these two models (VGG16 & VGG19) before this most recent update, but posting here just in case it might be

Just tested the update and its definitely much faster than it was with PIL, so thanks for that!!

@jamesrequa - Is the error only with these two models? Does it work well if you change it to resnet34?

Restarting AWS helped.

1 Like

Yep only those two VGG16 and VGG19 that I’ve seen so far. I haven’t tried all models that are available but I’ve tried a handful of others like Resnet34, Resnext, Inception, etc. all without this issue.

@ramesh
UPDATE: Actually now I can see the same error happening with all versions of DenseNet which I know worked fine before the update.

1 Like

Although I haven’t done a git pull to get the latest updates (not yet), I get the same error with learn.fit using DN201. All the other models seem to run fine.

1 Like

Context: Trying to use VGG16 architecture on the Iceberg challenge. Iceberg challenge has 75x75 ‘images’ but AFAIK the vgg16/vgg16_bn in fastai are actually Fully-Conv networks. So hopefully a 75x75 input image should be Ok.

Environment: Just ran git pull and conda env update on Amazon p2 instance

Code:

%reload_ext autoreload
%autoreload 2

import torch
from fastai.imports import *

from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

from PIL import ImageFile

import seaborn as sns
sns.set_style('whitegrid')
colors = plt.get_cmap('tab10').colors

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

path = "/home/akomma/datasets/iceberg/composites/"
! ls {path}

def get_data(sz, bs):
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_top_down, max_zoom=1.00)
    data = ImageClassifierData.from_paths(path,
                                          test_name='test',
                                          bs=bs,
                                          tfms=tfms)
    return data

arch = vgg16 # vgg16_bn, vgg19, vgg_bn, dn121, dn161, dn169, dn201,
sz = 75
bs = 32

data = get_data(sz, bs)
learn = ConvLearner.pretrained(arch, data, precompute=False)

lr = 1e-2
learn.fit(lr, 5)

Stack Trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-b79876ba01d1> in <module>()
     10 
     11 lr = 1e-2
---> 12 learn.fit(lr, 5)

~/lab_kaggle/iceberg/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
     96     def fit(self, lrs, n_cycle, wds=None, **kwargs):
     97         self.sched = None
---> 98         layer_opt = self.get_layer_opt(lrs, wds)
     99         self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    100 

~/lab_kaggle/iceberg/fastai/learner.py in get_layer_opt(self, lrs, wds)
     92 
     93     def get_layer_opt(self, lrs, wds):
---> 94         return LayerOptimizer(self.opt_fn, self.get_layer_groups(), lrs, wds)
     95 
     96     def fit(self, lrs, n_cycle, wds=None, **kwargs):

~/lab_kaggle/iceberg/fastai/layer_optimizer.py in __init__(self, opt_fn, layer_groups, lrs, wds)
     15         if len(wds)==1: wds=wds*len(layer_groups)
     16         self.layer_groups,self.lrs,self.wds = layer_groups,lrs,wds
---> 17         self.opt = opt_fn(self.opt_params())
     18 
     19     def opt_params(self):

~/lab_kaggle/iceberg/fastai/core.py in <lambda>(*args, **kwargs)
     63 
     64 def SGD_Momentum(momentum):
---> 65     return lambda *args, **kwargs: optim.SGD(*args, momentum=momentum, **kwargs)
     66 
     67 def one_hot(a,c): return np.eye(c)[a]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/optim/sgd.py in __init__(self, params, lr, momentum, dampening, weight_decay, nesterov)
     54         if nesterov and (momentum <= 0 or dampening != 0):
     55             raise ValueError("Nesterov momentum requires a momentum and zero dampening")
---> 56         super(SGD, self).__init__(params, defaults)
     57 
     58     def __setstate__(self, state):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/optim/optimizer.py in __init__(self, params, defaults)
     40             group_set = set(group['params'])
     41             if not param_set.isdisjoint(group_set):
---> 42                 raise ValueError("some parameters appear in more than one "
     43                                  "parameter group")
     44             param_set.update(group_set)

ValueError: some parameters appear in more than one parameter group

I get the same error ValueError: some parameters appear in more than one parameter group when running:

  • vgg16
  • vgg19
  • dn121
  • dn161
  • dn169
  • dn201

However, resnets and resnexts and inceptions models do not have this issue.
Also what are differences between resnext101, resnext101_64, resnext_101_32x4d, resnext_101_64x4d models? Which one to use?

@abi be sure to give us the information shown here so we can help you: http://wiki.fast.ai/index.php/How_to_ask_for_Help

Just edited my post to add more context and how to reproduce the error.

Hi,
I was testing my new GTX 1080Ti and reinstalled environment instead of AWS on home box.
Hence, I pulled brand new fastai repo on Nov 22, which I haven’t pulled since Nov 18.

In my case I ve got large CPU utilization with Nov 22 version of fastai. Once the whole PC freezed, and very long learner instantiate and training.

When I fall back to Nov 18 snapshot of repo everything works fine and fast.

I am wondering if anyone else has similar issues?

This is on a same box with GTX 1080Ti, same notebook, same dogscats dataset:
Nov 22 repo:
image

Nov 18 repo:
image

P.S. I will test back on AWS fastai AMI and will to try to investigate repo differences…

That’s really odd! Yeah would love to hear if you can narrow down which change is causing this.

I definitely able to reproduce it on local box right after cv change.

!git reset --hard b1ba89663c16f81ae33220319d08a4f87988eb94
HEAD is now at b1ba896 use cv2 and threadpool

But if no one else has this issue, I suspect something with my openCV library…
Will keep looking

Can you say a little bit more about your config -

  1. How much CPU / RAM you have on this machine.
  2. How does top or htop or free -h look?
  3. How does nvidia-smi look during the process in both systems
  4. Which Notebook are you using for this test (lesson1?)

Also can you give us both Before and after git commit-hash you are using for this test?

How did you install opencv? Did you just use the version in conda env update, or did you install by hand?

Yes it was lesson 1 in the test. I have i7-3770 and 32Gb RAM.
It was opencv version problem, I was on 3.1.0 and fastai ami uses 3.3.0.

I forgot to do conda env update after conda env create -f environment.yml and was on 3.1.0 ver.

After conda update it is all fine.
Thanks @jeremy @ramesh

2 Likes

My notebook (lesson2-image_models) kernel dies whenever I try to run
log_preds, y = learn.TTA(is_test=False). I have my own GPU machine (1080Ti GPU) . Seems like the GPU freezes and I have to reboot. I did not have this problem before. Any one else has this problem?