@jeremy is there a reason for LM and text classifier learner we used moms=(0.8,0.7), and how did you arrive at that value through trial and error , or is there a general rule of thumb?
Also, reading through the fastai docs, the learners seems to use Adam with fixed weight decay, is this the preferred method now over sgd with nesterov momentum? What’s the process you use to determine the best weight decay values (as it seems like you only use custom values for the last model fitting stage)?
@jeremy,
All the data is as delivered by kaggle
The train.csv is as delivered by kaggle (whale-categorization-playground).
It has image fn (from train) + class name
The train folder (again as delivered) contains all the images that are listed in train.
There is a test folder (which has images) which are not listed in the csv file for obvious reasons (train.csv provides labels and test should not have any labels). my expectation was that random_split_by_pct(0.2) would split the images in train into train and valid grouping.
(I can imagine that this kind of data representation may be common to many kaggle competitions)
Interestingly the keyerror is pointing to a class name (KeyError: ‘w_e15442c’) and the train directory contains images with names (00022e1a.jpg) which definitely exists.
IMHO this isn’t a fabulous dataset for fastai cnn’s unless it is to understand limitations. IIRC about half the classes in train only have 1 or 2 images, and half the images in test aren’t even in train. So you’d either not be able to have a class in both val and train or just a single image in each. Again IIRC Siamese/triplet networks worked best.
How to wrap multichannel images(channel > 3) like brain scans in SegmentationDataset and how to do transfer learning on it? Like I was thinking of freezing the full body except the first conv layer. Then how to do it in fast.ai
@jeremy, thank you for your time jeremy. I think the train.csv (the label file) from kaggle has some issues. from pdb.set_trace() I see that the number of self.classes < unique label names in the csv file. I think there are some white space issues that are not easy to catch
to what i undestand what Jeremy is saying that for the DataBunch to work the train and valid sets need to have same classes, if that is not the case then one need to specify and pass all classes directly.
i wonder if library could automate that check and passing of all classes?
I tried to experiment with single label classifier as in Lesson 1, but I wanted to increase image size in later stage, so I had to keep the validation split as a separate part of code.
What would be the appropriate way to indicate that labels should be read from folders?
I used the following function lambda function, which worked, but it seems a bit hacky:
@miwojc, I thought there was something wrong with the train.csv file (maybe some white space issue) . But pandas opens it nicely in a dataframe.
Now the question is that we are relying on random_split_by_pct to split between train and valid split. So how can one specify classes explicitly for train and valid split?
Instead I should create separate directories train and valid move 20% images from train to valid. Modify the train.csv to prefix the image name with directory name ‘train/’ or ‘valid/’ ? I could do that. I am sure that the train directory contains some classes (labels) with just one image. If the image goes into valid directory then that label will only be in valid and not in train. I think that is what digitalspecialists (RobG) also remembers of the dataset. So interesting challenge. The #1 on the LB also mentions Siamese/triplet networks. I started this exercise w/o reading the prior discussion on kaggle and even if I had, I would not have guessed that fastai v3 will not address it. It’s been a learning!
Yes as I said approx 45% of test are ‘unknown’. For others there are only 1 example in the training data so you can’t split val/train. And the image region of interest is long and thin (whale fluke), so not a great candidate for classification cnn’s out of the box. I think I ended up with about a 4% accuracy for the remainder.
If your datasets don’t work well with a naive percentage sampling, you can use lower level data constructors to specify which are train and validation images.
@digitalspecialists RobG, what I have managed so far is to move 1970 random images (out of 9850 i.e. apprx 20%) from train directory to valid directory. I have built a labels.csv file that has:
My expectation is that a databunch should get created when I run:
data = ImageDataBunch.from_folder(path, ds_tfms=([]), bs=64)
It does not and error is KeyError: 'valid’
I presume that this is because the ImageDataBunch.from_folder code expects same classes to exist in train and valid folders.
How do I pass a full list of classes explicitly (as advised by Jeremy). I have that ready (a list of 4251 classes) as well. I know we did that for camvid.
You can see that I am doing things by looking at patterns (how we did something somewhere else) but I was hoping that docs.fast.ai would be more explanatory in that regard
To miwojc’s point , there is opportunity for the library to automate this. Maybe even clean things up (same df called df_fn_labels is feeding 2-3 objects)
I am sure the performance to the learner will be bad because of single image classes being in validation dataset not seen in training dataset. But at least it’s a start
Am i right that in string
learn = Learner.create_unet(data, models.resnet34, metrics=metrics)
Jeremy makes an U-net network, which will consist of 34 resnet layers as encoder and another 34 resnet layers as decoder?