Humpback Whale Identification Challenge

muhajir · December 1, 2018, 5:27pm

I have joined the Kaggle Humpback Whale Identification challenge. Humpback Whale Identification Challenge

I run everything similar to planet computation like explained in Lesson3. Everything works okey until training step. Data and error are

data = (
ImageItemList
.from_csv(path, ‘train.csv’, folder=“train”)
.random_split_by_pct(0.2)
.label_from_df(sep=’ ')
.transform(tfms, size=64)
.databunch()
.normalize(imagenet_stats)
)

I am getting the following error:

IndexError: arrays used as indices must be of integer (or boolean) type

how can I fix this ?

digitalspecialists · December 1, 2018, 6:07pm

The old playground version of this reinvigorated comp comes up regularly and the same issue may be causing this symptom. The dataset has many many classes (whales) with only 1 or 2 examples, and therefore may not be split appropriately in train/valid. My guess would be that could be causing your pain. My recommendation for all competitions (or indeed real life problems) is to start by looking at and studying the data before going near a cnn.

salil_23 · December 1, 2018, 6:35pm

I am also getting an error while running learn.fit_one_cycle. how can I resolve it?

AttributeError: ‘str’ object has no attribute ‘name’

bluesky314 · December 1, 2018, 7:19pm

I think this may be saying your target(class labels - 0,1,2…) should be integers and not float or else.

bluesky314 · December 1, 2018, 7:20pm

Try with a different metric maybe all errors seem to be pointing there. Maybe remove it and see. Or else share notebook

salil_23 · December 1, 2018, 8:22pm

Hey @bluesky314, I removed the metric from create_cnn and it worked but I again ran into same error as of OP. Can you tell more about how to resolve the error ?

bluesky314 · December 1, 2018, 8:29pm

Try : I think this may be saying your target(class labels - 0,1,2…) should be integers and not float or else. Make sure this is case. Also maybe check the line “.label_from_df(sep=’ ')”

salil_23 · December 1, 2018, 8:36pm

I haven’t used .label_from_df(sep=' ')
I am not understanding why we haven’t seen this error when Jeremy or others have worked with data having weird names from csv or folders ? Isn’t one hot encoding the names of the categories is a default in fastai ?

digitalspecialists · December 1, 2018, 8:45pm

IIRC about 40% of your classes have 1 image only. So your valid dataset or train dataset will be empty for many classes, and it wouldn’t be surprising to me if some odd error sprouts up, given that is not a usual data distribution. There is probably a way to force enumeration of all classes and avoid the error but that wouldn’t do much to help training. For the playground version of the comp, I used heavy data augmentation to create more copies of such images.

larcat · December 4, 2018, 1:10pm

You guys have any vague advice for how your constructing submissions for MAP5 format?

I finally got a model trained last night and want to see if I’m beating “new_whale”

amitkayal · December 4, 2018, 1:53pm

Excellent one. But how we can handle such scenarios and treat such minority class ones? Should we augment them and generate more images then proceed for train/validation split?

salil_23 · December 4, 2018, 3:31pm

Hi @larcat, could you share something about how you were able to work out the problem, we were facing or basically how you got your model trained?

larcat · December 4, 2018, 3:59pm

Sure. I’d like to strongly reiterate that I suspect my current approach is not good, but for what it is worth…

df = pd.read_csv('whales/train.csv', engine = 'python')
df['n'] = df.groupby('Id')['Id'].transform('count')
df_low_cnt = df[df.n < 5]
stem = 'whales/train/'
tfms = get_transforms()
df_test = df.copy()
new_imgs = {'Image': [], 'Id': []}
for index, row in df_low_cnt.iterrows():    
    tsfm_num = 5 - row['n']
    cur_img = open_image(stem + row['Image'])
    for num in range(tsfm_num):
        img_name = re.sub(".jpg", "", row['Image']) + "_" + str(num) + ".jpg"
        new_imgs['Image'].append(img_name)
        new_imgs['Id'].append(row['Id'])
        out_img = cur_img.apply_tfms(tfms[0], size=224)
        out_img.save(stem + img_name)

new_imgs_df = pd.DataFrame(new_imgs)
df_test_new = df_test.append(new_imgs_df, ignore_index=True)
df_test_new = df_test_new[(df_test_new.Id != 'new_whale')]
df_test_new_out = df_test_new[['Image', 'Id']]
df_test_new_out.to_csv("./whales/train_munged.csv", index = False)

And then use the new .csv for loading. If you need to delete the new images, just GREP ‘_’ – none of the original images have the character in the name.

larcat · December 5, 2018, 2:56pm

Anyone have a quick code snip for putting a submission together given a model?

Thanks.

Bhuvana_ka · December 7, 2018, 11:17am

Hi,

I have joined the Kaggle Humpback Whale Identification challenge. Humpback Whale Identification Challenge

Here is the link to my github repo - https://github.com/bhuvanakundumani/humb_back_whale_identification.git

I have
data = ImageDataBunch.from_csv(path ,csv_labels=‘train.csv’, ds_tfms=tfms, size=24);

I am getting an error when i run
data.show_batch(rows=3, figsize=(6,6))

FileNotFoundError Traceback (most recent call last)
in
----> 1 data.show_batch(rows=3, figsize=(6,6))

/opt/anaconda3/lib/python3.6/site-packages/fastai/basic_data.py in show_batch(self, rows, ds_type, **kwargs)
121 if rows is None: rows = int(math.sqrt(len(b_idx)))
122 ds = dl.dataset
–> 123 ds[0][0].show_batch(b_idx, rows, ds, **kwargs)
124
125 def alt_show_batch(data, rows:int=None, ds_type:DatasetType=DatasetType.Train, **kwargs)->None:

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in getitem(self, idxs)
413 def getitem(self,idxs:Union[int,np.ndarray])->‘LabelList’:
414 if isinstance(try_int(idxs), int):
–> 415 if self.item is None: x,y = self.x[idxs],self.y[idxs]
416 else: x,y = self.item ,0
417 if self.tfms:

/opt/anaconda3/lib/python3.6/site-packages/fastai/data_block.py in getitem(self, idxs)
80
81 def getitem(self,idxs:int)->Any:
—> 82 if isinstance(try_int(idxs), int): return self.get(idxs)
83 else: return self.new(self.items[idxs], xtra=index_row(self.xtra, idxs))
84

/opt/anaconda3/lib/python3.6/site-packages/fastai/vision/data.py in get(self, i)
288 def get(self, i):
289 fn = super().get(i)
–> 290 res = self.open(fn)
291 self.sizes[i] = res.size
292 return res

/opt/anaconda3/lib/python3.6/site-packages/fastai/vision/data.py in open(self, fn)
284 self.sizes={}
285
–> 286 def open(self, fn): return open_image(fn)
287
288 def get(self, i):

/opt/anaconda3/lib/python3.6/site-packages/fastai/vision/image.py in open_image(fn, div, convert_mode, cls)
440 “Return Image object created from image in file fn.”
441 #fn = getattr(fn, ‘path’, fn)
–> 442 x = PIL.Image.open(fn).convert(convert_mode)
443 x = pil2tensor(x,np.float32)
444 if div: x.div_(255)

/opt/anaconda3/lib/python3.6/site-packages/PIL/Image.py in open(fp, mode)
2607
2608 if filename:
-> 2609 fp = builtins.open(filename, “rb”)
2610 exclusive_fp = True
2611

FileNotFoundError: [Errno 2] No such file or directory: ‘data_whale/./0000e88ab.jpg’

How do i fix this?

larcat · December 7, 2018, 1:02pm

folder = ‘train’ in your dataloader line – see the line in the error here:
data_whale/./0000e88ab.jpg’

Bhuvana_ka · December 7, 2018, 3:07pm

Thanks.

amitkayal · December 8, 2018, 1:49pm

Can anyone please help me to understand what are the attributes of learner class? I wanted to override the model optimizer and cant findout how to achieve this…

Thanks
Amit

amitkayal · December 8, 2018, 2:30pm

Has anyone been able to achieve good accuracy? I am stuck with 45% accuracy with resnet50…

Thanks
Amit

larcat · December 8, 2018, 3:31pm

Are you dropping new whale or keeping it? Etc, etc.

This is a tough one to compare model accuracy because of the nature of the data and the varied ways people will approach the problem.