Segmentation in V1

The last one is linked to pytorch. On what environment are you running this?

https://colab.research.google.com/gist/tcapelle/4083ffd865fabc8703175515d521a7f2/tgs-fastai-v1.ipynb

Ok, so on colab, there is an issue with pytorch as reported here. You need to increase the amount of shared memory you have.

Do you know why the first model just runs fine then and the segmentation one does not?

Thank you for your help and patience. :slight_smile:

But I am not sure if I understood it correctly.
I set up like this (and I also tried the other combination, see below):

def get_tfms_datasets(size, paths, tfms):
    datasets = get_datasets(paths)
    #print('get_tfms_datasets tfms:\n', tfms)
    return transform_datasets(*datasets, test_ds=None, tfms=tfms, tfm_y=True, size=size)

def get_data(bs, size, paths):
    return DataBunch.create(*get_tfms_datasets(size=size, paths=paths, tfms=[get_transforms(), get_transforms()]), bs=bs, size=size, tfms=rsna_norm)

data = get_data(bs, size, paths)

These are the results:

data.train_dl.tfms & data.valid_dl.tfms show the norm function:

[functools.partial(<function _normalize_batch at 0x7f0162126510>, mean=tensor([0.4914, 0.4914, 0.4914]), std=tensor([0.2492, 0.2492, 0.2492]))]

data.train_ds.tfms & data.valid_ds.tfms show the data augm. func.:

([RandTransform(tfm=TfmCrop (crop_pad), kwargs={'row_pct': (0, 1), 'col_pct': (0, 1)}, p=1.0, resolved={}, do_run=True, is_random=True),
  RandTransform(tfm=TfmPixel (flip_lr), kwargs={}, p=0.5, resolved={}, do_run=True, is_random=True),
  RandTransform(tfm=TfmCoord (symmetric_warp), kwargs={'magnitude': (-0.2, 0.2)}, p=0.75, resolved={}, do_run=True, is_random=True),
  RandTransform(tfm=TfmAffine (rotate), kwargs={'degrees': (-10.0, 10.0)}, p=0.75, resolved={}, do_run=True, is_random=True),
  RandTransform(tfm=TfmAffine (zoom), kwargs={'row_pct': (0, 1), 'col_pct': (0, 1), 'scale': (1.0, 1.1)}, p=0.75, resolved={}, do_run=True, is_random=True),
  RandTransform(tfm=TfmLighting (brightness), kwargs={'change': (0.4, 0.6)}, p=0.75, resolved={}, do_run=True, is_random=True),
  RandTransform(tfm=TfmLighting (contrast), kwargs={'scale': (0.8, 1.25)}, p=0.75, resolved={}, do_run=True, is_random=True)],
 [RandTransform(tfm=TfmCrop (crop_pad), kwargs={}, p=1.0, resolved={}, do_run=True, is_random=True)])

But with:

show_image_batch(data.train_dl, data.train_ds.classes, rows=3, figsize=(5,5))

I still get the same AttributeError: 'list' object has no attribute 'tfm' from above.

When I debug the error, the dl in line 42 shows the norm. func.:

/home/paperspace/fastai/fastai/vision/data.py(42)show_image_batch()
     40                      denorm:Callable=None) -> None:
     41     "Show a few images from a batch."
---> 42     x,y = next(iter(dl))
     43     if rows is None: rows = int(math.sqrt(len(x)))
     44     x = x[:rows*rows].cpu()

ipdb> dl
DeviceDataLoader(dl=<torch.utils.data.dataloader.DataLoader object at 0x7f016054b780>, device=device(type='cuda'), tfms=[functools.partial(<function _normalize_batch at 0x7f0162126510>, mean=tensor([0.4914, 0.4914, 0.4914]), std=tensor([0.2492, 0.2492, 0.2492]))], collate_fn=<function data_collate at 0x7f016922a840>)

This also happens when I change the tfms func. to (= exchange the two tfms func. from above with each other):

def get_data(bs, size, paths):
    return DataBunch.create(*get_tfms_datasets(size=size, paths=paths, tfms=[rsna_norm, rsna_norm]), bs=bs, size=size, tfms=get_transforms())

With that setup I still get the AttributeError: 'functools.partial' object has no attribute 'tfm' even though I see at the debugging the tfm attribute in the dl that generates the error:

DeviceDataLoader(dl=<torch.utils.data.dataloader.DataLoader object at 0x7f01605384a8>, device=device(type='cuda'), tfms=[[RandTransform(tfm=TfmCrop (crop_pad), kwargs={'row_pct': (0, 1), 'col_pct': (0, 1)}, p=1.0, resolved={}, do_run=True, is_random=True), RandTransform(tfm=TfmPixel (flip_lr), kwargs={}, p=0.5, resolved={}, do_run=True, is_random=True), RandTransform(tfm=TfmCoord (symmetric_warp), kwargs={'magnitude': (-0.2, 0.2)}, p=0.75, resolved={}, do_run=True, is_random=True), RandTransform(tfm=TfmAffine (rotate), kwargs={'degrees': (-10.0, 10.0)}, p=0.75, resolved={}, do_run=True, is_random=True), RandTransform(tfm=TfmAffine (zoom), kwargs={'row_pct': (0, 1), 'col_pct': (0, 1), 'scale': (1.0, 1.1)}, p=0.75, resolved={}, do_run=True, is_random=True), RandTransform(tfm=TfmLighting (brightness), kwargs={'change': (0.4, 0.6)}, p=0.75, resolved={}, do_run=True, is_random=True), RandTransform(tfm=TfmLighting (contrast), kwargs={'scale': (0.8, 1.25)}, p=0.75, resolved={}, do_run=True, is_random=True)], [RandTransform(tfm=TfmCrop (crop_pad), kwargs={}, p=1.0, resolved={}, do_run=True, is_random=True)]], collate_fn=<function data_collate at 0x7f016922a840>)

I also checked the docs for ObjectDetectDataset and transform_datasets.

I am not sure what I can mess up or in which direction I should debug further?

If I understand v1 myself (a stretch!), the list error comes from tfms=[get_transforms(), get_transforms()] which should be ds_tfms=get_transforms() and am not sure that you need get_tfms_datasets as the DataBunch creator already calls it, just pass in your datasets.

1 Like

I think it’s due to our inefficient implementation of bounding boxes for data augmentation. It will change soon as we commit the transformations for points inside the main fastai library.

1 Like

nothing is working now =(, resnet34 is not there anymore, Darknet does not work, snif…
What would you recommend to be able to help in dev, a paperspace instance?
Colab is free, and most people will try the library there first, and the K80 is not bad.
Kaggle is not working either.

I am trying unet with v1 and have also arrived at a state_dict error. I’m using the latest github code.

My datasets look fine when I inspect them. I create a learner as per docs/example with

body = create_body(tvm.resnet34(True), -2)
model = DynamicUnet(body, n_classes=2) # tried with and without cuda()
learn = Learner(data, model, metrics=metrics,
loss_fn=CrossEntropyFlat())
learn.split([model[0][6], model[1]])
learn.freeze()

but it errors on any act like lr_find(learn) with Error(s) in loading state_dict for DynamicUnet: While copying the parameter named "0.0.weight", whose dimensions in the model are torch.Size([64, 3, 7, 7]) and whose dimensions in the checkpoint are torch.Size([64, 3, 7, 7]).

I see models/tmp.pth get created before the error. I can’t see anything glaringly incompatible with learn.model.state_dict()['0.0.weight'] and nn.Sequential(*list(tvm.resnet34(True).children())[:-2])

Note that segmentation is still under development and not thoroughly tested. Saying nothing is working is perhaps a bit exaggerated and not very helpful. Giving fulls tack error messages or gists that reproduce bugs is a way to help us fix those annoying failures.

I just ran a DynamicUnet with resnet34 pretrained and didn’t get any problem in the lr_find. @digitalspecialists can you check you still have the bug with the latest version of fastai? If yes, on which type of hardware/instance? There was a bug with the current open_mask and the loss (values where 0 or 255 for the targets, instead of 0 or 1) but I pushed a fix.

Thanks I will give it a go. Totally understand that it is a work in progress and will be happy to report results :slight_smile:

Darknet is working now, was hoping to build a Darknet-Unet.
The Darknet classifier is working after changing the new ImageDataBunch.from_csv(). Can’t make resnet34 to work, different error this time, it comes form pytorch implementation of pytorch.
Any tips how to print tensor sizes at each stage of the resnet?

resnet = tvm.resnet34(True)
learn = Learner(md, resnet, metrics=accuracy)
lr_find(learn)
learn.recorder.plot()
>>epoch  train loss  valid loss  accuracy

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-117-dd390b1c8108> in <module>()
----> 1 lr_find(learn)
      2 learn.recorder.plot()

/usr/local/lib/python3.6/dist-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, **kwargs)
     24     cb = LRFinder(learn, start_lr, end_lr, num_it)
     25     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 26     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     27 
     28 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    136         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    137         fit(epochs, self.model, self.loss_fn, opt=self.opt, data=self.data, metrics=self.metrics,
--> 138             callbacks=self.callbacks+callbacks)
    139 
    140     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     89     except Exception as e:
     90         exception = e
---> 91         raise e
     92     finally: cb_handler.on_train_end(exception)
     93 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in fit(epochs, model, loss_fn, opt, data, callbacks, metrics)
     79             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     80                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 81                 loss = loss_batch(model, xb, yb, loss_fn, opt, cb_handler)[0]
     82                 if cb_handler.on_batch_end(loss): break
     83 

/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_fn, opt, cb_handler, metrics)
     17     if not is_listy(xb): xb = [xb]
     18     if not is_listy(yb): yb = [yb]
---> 19     out = model(*xb)
     20     out = cb_handler.on_loss_begin(out)
     21 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torchvision/models/resnet.py in forward(self, x)
    146         x = self.layer4(x)
    147 
--> 148         x = self.avgpool(x)
    149         x = x.view(x.size(0), -1)
    150         x = self.fc(x)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    475             result = self._slow_forward(*input, **kwargs)
    476         else:
--> 477             result = self.forward(*input, **kwargs)
    478         for hook in self._forward_hooks.values():
    479             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/pooling.py in forward(self, input)
    545     def forward(self, input):
    546         return F.avg_pool2d(input, self.kernel_size, self.stride,
--> 547                             self.padding, self.ceil_mode, self.count_include_pad)
    548 
    549 

RuntimeError: Given input size: (512x4x4). Calculated output size: (512x-2x-2). Output size is too small at /pytorch/aten/src/THCUNN/generic/SpatialAveragePooling.cu:63

You should use a convlearner with resnets, even if you only want to do classification: torchvision implementation doesn’t use an adaptive average pooling like fastai does.

OK, good to know. ConvLearner cuts the model, and then appends a head.
It trains now, I get some warnings and no progress bar, but it trains.

Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 165, in __del__
    def __del__(self): del(self.model, self.data)
AttributeError: model
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 165, in __del__
    def __del__(self): del(self.model, self.data)
AttributeError: model

0      0.732525  

Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 165, in __del__
    def __del__(self): del(self.model, self.data)
AttributeError: model
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 165, in __del__
    def __del__(self): del(self.model, self.data)
AttributeError: model

1      2.720228  

Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/fastai/basic_train.py", line 165, in __del__
    def __del__(self): del(self.model, self.data)
AttributeError: model

Having access to one_cycle directly from the learner is pretty neat.

Yeah, if you’re on colab, the progress bars don’t work. This is a known issue that I hope to address soon.

Give some hints on what kind of instance is the best for using fastai now, papersapce? AWS?

Paperspace and AWS should both work fine.

Thanks Sebastian for your time. I am really anxious for the updated notebooks.

In paperspace, the best option to use V1 is to setup from zero, or start from the fastai template?

I think a fresh start is probably better, then following the instructions. The fastai template is for 0.7 as far as I know.