Lesson 3 Advanced Discussion ✅

maxmatical · November 10, 2018, 11:34pm

@jeremy is there a reason for LM and text classifier learner we used moms=(0.8,0.7), and how did you arrive at that value through trial and error , or is there a general rule of thumb?

Also, reading through the fastai docs, the learners seems to use Adam with fixed weight decay, is this the preferred method now over sgd with nesterov momentum? What’s the process you use to determine the best weight decay values (as it seems like you only use custom values for the last model fitting stage)?

miwojc · November 11, 2018, 12:20am

i am frequently updating fastai, the issue i had was a week ago, not sure what version was it at that time…

jeremy · November 11, 2018, 2:09am

Sounds like there are classes in valid that aren’t in train. Try creating the full list of classes first and pass them explicitly.

jeremy · November 11, 2018, 2:10am

They experimentally seem to work well for RNNs - we’re planning to make them the default.

Yup.

https://www.fast.ai/2018/07/02/adam-weight-decay/

sam2 · November 11, 2018, 9:42am

@jeremy,
All the data is as delivered by kaggle
The train.csv is as delivered by kaggle (whale-categorization-playground).
It has image fn (from train) + class name
The train folder (again as delivered) contains all the images that are listed in train.
There is a test folder (which has images) which are not listed in the csv file for obvious reasons (train.csv provides labels and test should not have any labels). my expectation was that random_split_by_pct(0.2) would split the images in train into train and valid grouping.
(I can imagine that this kind of data representation may be common to many kaggle competitions)

Interestingly the keyerror is pointing to a class name (KeyError: ‘w_e15442c’) and the train directory contains images with names (00022e1a.jpg) which definitely exists.

digitalspecialists · November 11, 2018, 1:09pm

IMHO this isn’t a fabulous dataset for fastai cnn’s unless it is to understand limitations. IIRC about half the classes in train only have 1 or 2 images, and half the images in test aren’t even in train. So you’d either not be able to have a class in both val and train or just a single image in each. Again IIRC Siamese/triplet networks worked best.

sam2 · November 11, 2018, 1:37pm

@digitalspecialists You could be so right RobG ! Nonetheless what bugs me is that I could not even build a model !!

sam2 · November 11, 2018, 4:34pm

Hello all,
Reaching out for help.
Can someone tell me what is equivalent to (of v0.7):

md = ImageClassifierData.from_csv(path, ‘train’, path/‘train.csv’, tfms=tfms, bs=bs, test_name=‘test’)

??

Clearly data = ImageDataBunch.from_csv(path, folder=path/‘train’, csv_labels=‘labels.csv’, ds_tfms=get_transforms(), size=128) fails

swagman · November 11, 2018, 5:15pm

How to wrap multichannel images(channel > 3) like brain scans in SegmentationDataset and how to do transfer learning on it? Like I was thinking of freezing the full body except the first conv layer. Then how to do it in fast.ai

sam2 · November 11, 2018, 6:02pm

Never mind there is some issue with the train.csv file from kaggle. It’s probably some whitespace issue that is hard to catch visually

sam2 · November 11, 2018, 6:06pm

@jeremy, thank you for your time jeremy. I think the train.csv (the label file) from kaggle has some issues. from pdb.set_trace() I see that the number of self.classes < unique label names in the csv file. I think there are some white space issues that are not easy to catch

miwojc · November 11, 2018, 6:13pm

to what i undestand what Jeremy is saying that for the DataBunch to work the train and valid sets need to have same classes, if that is not the case then one need to specify and pass all classes directly.

i wonder if library could automate that check and passing of all classes?

gbecon · November 11, 2018, 7:04pm

I tried to experiment with single label classifier as in Lesson 1, but I wanted to increase image size in later stage, so I had to keep the validation split as a separate part of code.

What would be the appropriate way to indicate that labels should be read from folders?

I used the following function lambda function, which worked, but it seems a bit hacky:

get_y_fn = lambda o: o.parent.name

src = (ImageFileList.from_folder(path)
   .label_from_func(get_y_fn)
   .random_split_by_pct(0.2))
data = (src.datasets()
    .transform(get_transforms(), size=size)
    .databunch(bs=bs)
    .normalize(imagenet_stats))

sam2 · November 11, 2018, 7:53pm

@miwojc, I thought there was something wrong with the train.csv file (maybe some white space issue) . But pandas opens it nicely in a dataframe.
Now the question is that we are relying on random_split_by_pct to split between train and valid split. So how can one specify classes explicitly for train and valid split?
Instead I should create separate directories train and valid move 20% images from train to valid. Modify the train.csv to prefix the image name with directory name ‘train/’ or ‘valid/’ ? I could do that. I am sure that the train directory contains some classes (labels) with just one image. If the image goes into valid directory then that label will only be in valid and not in train. I think that is what digitalspecialists (RobG) also remembers of the dataset. So interesting challenge. The #1 on the LB also mentions Siamese/triplet networks. I started this exercise w/o reading the prior discussion on kaggle and even if I had, I would not have guessed that fastai v3 will not address it. It’s been a learning!

digitalspecialists · November 11, 2018, 8:01pm

Yes as I said approx 45% of test are ‘unknown’. For others there are only 1 example in the training data so you can’t split val/train. And the image region of interest is long and thin (whale fluke), so not a great candidate for classification cnn’s out of the box. I think I ended up with about a 4% accuracy for the remainder.

But on the wider point, constructing good validation sets is critical and there is no better description than at https://www.fast.ai/2017/11/13/validation-sets/

If your datasets don’t work well with a naive percentage sampling, you can use lower level data constructors to specify which are train and validation images.

sam2 · November 11, 2018, 9:11pm

@digitalspecialists RobG, what I have managed so far is to move 1970 random images (out of 9850 i.e. apprx 20%) from train directory to valid directory. I have built a labels.csv file that has:

Image	Id
data/whales/train/007c3603.jpg	new_whale
data/whales/train/00863b8c.jpg	new_whale
data/whales/valid/92e33de7.jpg	w_dbda0d6
data/whales/valid/a2dbf46d.jpg	w_593485f
data/whales/valid/3173ae78.jpg	w_11adaae

So now I have data approximately as in the mnist example (https://github.com/fastai/fastai/blob/master/examples/vision.ipynb)

My expectation is that a databunch should get created when I run:

data = ImageDataBunch.from_folder(path, ds_tfms=([]), bs=64)

It does not and error is KeyError: 'valid’

I presume that this is because the ImageDataBunch.from_folder code expects same classes to exist in train and valid folders.

How do I pass a full list of classes explicitly (as advised by Jeremy). I have that ready (a list of 4251 classes) as well. I know we did that for camvid.

You can see that I am doing things by looking at patterns (how we did something somewhere else) but I was hoping that docs.fast.ai would be more explanatory in that regard

sam2 · November 11, 2018, 9:59pm

Hello all (@miwojc, @digitalspecialists)
Thank you for your patience !

Finally my databunch is ready and a learner based on resnet34 is finding the lr.

Here is what I had to do.

move 1970 random images (out of 9850 i.e. apprx 20%) from train directory to valid directory
built a labels.csv with two columns containing path+image_name & class name
create a codes file containing unique class names from the labels.csv
Build databunch using:

codes = np.loadtxt(path/‘codes.txt’, dtype=str)
df_fn_labels=pd.read_csv(path/‘labels.csv’, index_col=None)
fnames=list(df_fn_labels[‘Image’])
labels=list(df_fn_labels[‘Id’])
data = (ImageFileList.from_folder(path) .label_from_df(df_fn_labels, fn_col=0, label_col=1) .split_by_folder() .datasets(ImageClassificationDataset, fns=fnames, labels=labels, classes=codes) .transform(get_transforms(), size=128) .databunch() .normalize(imagenet_stats))

To miwojc’s point , there is opportunity for the library to automate this. Maybe even clean things up (same df called df_fn_labels is feeding 2-3 objects)

I am sure the performance to the learner will be bad because of single image classes being in validation dataset not seen in training dataset. But at least it’s a start

radikubwa · November 12, 2018, 12:25am

elu is mostly used in models involved with self driving cars. To my knowledge.

kuil · November 12, 2018, 1:42pm

Am i right that in string
learn = Learner.create_unet(data, models.resnet34, metrics=metrics)
Jeremy makes an U-net network, which will consist of 34 resnet layers as encoder and another 34 resnet layers as decoder?

Kaspar · November 12, 2018, 1:54pm

yes thats it