Kaggle's Whale Competition

Hi folks,
A new competition was launched a week ago on Kaggle…

It’s a bit different from the rest we might have worked here

  • Images don’t have similar sizes(no issues)
  • But what if they even have different channels also?
  • Training is half the size of test (or contains more)

What should be done?
(using fast.ai prefered as it will help all or any)

Any thoughts will be appreciated…

Thanks…

Link

  • What I thought is like we can seperate all the channels and add more images to the training?(same for test)

  • A bit searching might lead to a paper?(like You only see once model which Jeremy was talking about In ML Lecs)
    Is it feasible?

@ramesh(sorry)
@EricPB (sorry)
@alessa (sorry)
@yinterian (sorry)

2 Likes

Looks to be an interesting Challenge. It’s basically a Multi-Label Classification similar to Planet Space for which FastAI works very well and there’s a sample on Planet problem in the courses repo.

Before jumping into building a model, I would start with data exploration. Some of these kernals might be a good place to start -

  1. https://www.kaggle.com/andersy005/getting-started
  2. https://www.kaggle.com/lextoumbourou/humpback-whale-id-data-and-aug-exploration

If you have different channels in the image, you may need to covert it to 3 Channel image, so that we can use Pre-trained network. I have not looked into the data yet, so can’t comment much on it.

3 Likes

Most of the images are greyscale, So it’s worth converting all to gray scale?

Ran into this error and having trouble understanding how to resolve the error. Using the same notebook as for planet and would appreciate any guidance. Running the notebook on AWS on p2.xlarge. Thanks.

Read another thread about this error here: http://forums.fast.ai/t/cuda-runtime-error-59/9085 but still cannot resolve.

train path:

ls {PATH}train -1 | wc
 9850    9850  246254
labels_csv = f'{PATH}train.csv'
n = len(list(open(labels_csv)))-1
val_idxs = get_cv_idxs(n)
n
9850
len(val_idxs)
1970
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCTensorCopy.c:u6708: 
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-4344f2365a4d> in <module>()
----> 1 learn.fit(lr, 1)
      2 learn.precompute=False

~/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    211         self.sched = None
    212         layer_opt = self.get_layer_opt(lrs, wds)
--> 213         self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    214 
    215     def warm_up(self, start_lr=1e-5, end_lr=10, wds=None):

~/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, **kwargs)
    158         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
    159         fit(model, data, n_epoch, layer_opt.opt, self.crit,
--> 160             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
    161 
    162     def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, **kwargs)
     95             if stop: return
     96 
---> 97         vals = validate(stepper, data.val_dl, metrics)
     98         print(np.round([epoch, debias_loss] + vals, 6))
     99         stop=False

~/fastai/courses/dl1/fastai/model.py in validate(stepper, dl, metrics)
    109     for (*x,y) in iter(dl):
    110         preds,l = stepper.evaluate(VV(x), VV(y))
--> 111         loss.append(to_np(l))
    112         res.append([f(to_np(preds),to_np(y)) for f in metrics])
    113     return [np.mean(loss)] + list(np.mean(np.stack(res),0))

~/fastai/courses/dl1/fastai/core.py in to_np(v)
     35     if isinstance(v, (list,tuple)): return [to_np(o) for o in v]
     36     if isinstance(v, Variable): v=v.data
---> 37     return v.cpu().numpy()
     38 
     39 USE_GPU=True

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/tensor.py in cpu(self)
     33     def cpu(self):
     34         """Returns a CPU copy of this tensor if it's not already on the CPU"""
---> 35         return self.type(getattr(torch, self.__class__.__name__))
     36 
     37     def double(self):

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/cuda/__init__.py in type(self, *args, **kwargs)
    368     def type(self, *args, **kwargs):
    369         with device(self.get_device()):
--> 370             return super(_CudaBase, self).type(*args, **kwargs)
    371 
    372     __new__ = _lazy_new

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/_utils.py in _type(self, new_type, async)
     36     if new_type.is_sparse:
     37         raise RuntimeError("Cannot cast dense tensor to sparse tensor")
---> 38     return new_type(self.size()).copy_(self, async)
     39 
     40 

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCTensorCopy.c:70
2 Likes

Read through the thread you linked too, the problem typically has to do with the expected dimensions of your input and targets or with their data types.

Follow the debugging advice in the post to help pinpoint which.

Actually it looks like you are mixing up running things on the cpu and the gpu. Those Cuda exceptions have to do with the gpu but looking at your stack trace you are running your model on the cpu.

@wgpubs thanks for the info. I did check the thread you referenced earlier but could still not resolve. Reading the threads I also thought it might be an GPU, CPU issue that is why I initially put down that I am using AWS p2 large which uses a GPU. When I looked at the AWS info for p2 large it states 1 GPU and 4 vCPUs so not sure it that matters but I have been using the same setup for all fastai and Kaggle usage.

Let me dig deeper…thanks

unless AWS is undercutting lol

ok problem solved! And I notice some cool updates to the fastai library. I updated the fastai folder and realized that there are quite a few changes made, after the update everything works just fine.

Also kudos for all those involved in getting this update stating what the 4 numbers are during training :grinning::+1: I believe @reshama requested this update a while ago!

Epoch
100% 2/2 [05:23<00:00, 161.82s/it]
epoch      trn_loss   val_loss   accuracy                   
    0      7.398534   7.753981   0.076028  
    1      6.994157   7.650898   0.078044  
2 Likes

thank you to whomever added the labels! I believe they are @apil.tamang @Matthew
dreams can come true :wink:

Actually, its recommended to convert all of the grayscale images to have 3 channels so this way you can take advantage of pre-trained NN’s.

3 Likes

Getting the same error as @amritv when passing no custom metrics to ConvLearner.pretrained(), and it fails after successful completion of one epoch:

Planet notebook (lesson2-image_models.ipynb) runs fine on the same machine, same configuration afaics.

Suspect that it’s the metric, and began defining an MAP@5 metric, but not yet successful:

from average_precision import mapk

def mapk5(preds, targs):
    return mapk(actual=targs, predicted=preds, k=5)

metrics=[mapk5]

With average_precision being https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py

Guess it’s not THAT simple, and I’ll have to actually look at the metrics code.

Any hints from you wonderful folks happily appreciated if you see I’m on the wrong path.

1 Like

@farlion
I re-ran the code again today,this is the same code that was working 6 days ago but is not now. I am now getting the same error after 1 epoch

f_model = resnet50
sz = 256
bs = 64
tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_top_down + transforms_side_on, max_zoom=1.05)
data = ImageClassifierData.from_csv(PATH, 'train', labels_csv, test_name='test', val_idxs=val_idxs, tfms=tfms, bs=bs)

error code:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCTensorCopy.c:65
2 Likes

Same thing here, except it has always thrown this error.

I think the issue is with the full data-set, I tested the code with a much smaller painfully validated data-set :sweat_smile: and the code worked but with the full set the error occurs after 1 epoch so there is something missing or mismatched in the full set

1 Like

Is this something we can easily do with the fast ai library? And if so, how?

Just read @amritv 's post about the data maybe not being valid. Thanks for the insight.

Hi folks

So I am doing this competition too - got sooo confused yesterday.

My Y value from the

x, y = next(iter(data.val_dl)) 

set returns a single dimension array(the same length as the batch size - 64) and I get randomly large numbers in this array. I just for the life of me figure out why the one hot encoding isn’t working…

data = ImageClassifierData.from_csv(
  PATH, 
  'train',
  label_csv,
  tfms=tfms_from_model(archiecture_chosen, sz, aug_tfms=transforms_side_on),
  test_name='test',
  val_idxs=val_idxs
);


print(len(val_idxs)) # validation set indexes
print(len(data.classes)) #individual classes there are

1970
4251

x,y = next(iter(data.trn_dl)) 

# x = First batch, 64 images, 3(RGB) x 244 x 244 per image


print(y)
# truth label indexes against each category - is what I am expecting.
# but it looks like I am getting the softmax i.e max(0, x) of each image in the batch(bs=64) returned. 

0
2409
1484
1632
3784
3175
2118
0
2443
638
1407
3134
1194
2525
0
977
1323
3942
2148
1048
1147
0
1392
2276
1904
3816
0
2796
2619
120
52
567
944
2305
3445
0
2017
1363
3861
2784
1208
1146
409
3275
3232
2720
2620
2348
2516
3614
2409
2511
3037
310
1545
3996
353
1280
3608
2193
2156
4197
551
3942
[torch.cuda.LongTensor of size 64 (GPU 0)]

list(zip(data.classes, y)) 
# zips the y(truth labels) with the images in the batch. 
# Showing that this is indeed one number per item in the batch.
# This looks so wrong :( where are my 1's and 0's

[(‘new_whale’, 0),
(‘w_0013924’, 2409),
(‘w_001ebbc’, 1484),
(‘w_002222a’, 1632),
(‘w_002b682’, 3784),
(‘w_002dc11’, 3175),
(‘w_0087fdd’, 2118),
(‘w_008c602’, 0),
(‘w_009dc00’, 2443),
(‘w_00b621b’, 638),
(‘w_00c4901’, 1407),
(‘w_00cb685’, 3134),
(‘w_00d8453’, 1194),
(‘w_00fbb4e’, 2525),
(‘w_0103030’, 0),
(‘w_010a1fa’, 977),
(‘w_011d4b5’, 1323),
(‘w_0122d85’, 3942),
(‘w_01319fa’, 2148),
(‘w_0134192’, 1048),
(‘w_013bbcf’, 1147),
(‘w_014250a’, 0),
(‘w_014a645’, 1392),
(‘w_0156f27’, 2276),
(‘w_015c991’, 1904),
(‘w_015e3cf’, 3816),
(‘w_01687a8’, 0),
(‘w_0175a35’, 2796),
(‘w_018bc64’, 2619),
(‘w_01a4234’, 120),
(‘w_01a51a6’, 52),
(‘w_01a99a5’, 567),
(‘w_01ab6dc’, 944),
(‘w_01b2250’, 2305),
(‘w_01c2cb0’, 3445),
(‘w_01cbcbf’, 0),
(‘w_01d6ca0’, 2017),
(‘w_01e1223’, 1363),
(‘w_01f211f’, 3861),
(‘w_01f8a43’, 2784),
(‘w_01f9086’, 1208),
(‘w_024358d’, 1146),
(‘w_0245a27’, 409),
(‘w_0265cb6’, 3275),
(‘w_026fdf8’, 3232),
(‘w_028ca0d’, 2720),
(‘w_029013f’, 2620),
(‘w_02a768d’, 2348),
(‘w_02b775b’, 2516),
(‘w_02bb4cf’, 3614),
(‘w_02c2248’, 2409),
(‘w_02c9470’, 2511),
(‘w_02cf46c’, 3037),
(‘w_02d5fad’, 310),
(‘w_02d7dc8’, 1545),
(‘w_02e5407’, 3996),
(‘w_02facde’, 353),
(‘w_02fce90’, 1280),
(‘w_030294d’, 3608),
(‘w_0308405’, 2193),
(‘w_0324b97’, 2156),
(‘w_032d44d’, 4197),
(‘w_0337aa5’, 551),
(‘w_034a3fd’, 3942)]


Any advice would be great.

1 Like

You can just use opencv to convert the grayscale images to rgb using the below function.
img_rgb = cv2.cvtColor(gray,cv2.COLOR_GRAY2RGB)

3 Likes

thanks! i ll check it out!!

Hi all, just a couple of quick questions, as I have been fiddling around with this competition as well.

  1. I will need to transform the grayscale images into RGB, and have been trying to do so according to https://stackoverflow.com/a/21709613. Unfortunately, it has not been working. Would anyone have any advice?
  2. Do you recommend doing grayscale -> RGB process before I resize the photos?

Did you find a solution to your problem already? I have the same issue.