Error_rate not available for cnn_learner from csv

gisderdube · February 12, 2020, 12:51pm

Hi there,

I’m following lesson 1 and working on the lesson-2 notebook. I successfully trained a model on my own image dataset and everything seems to be working fine. After loading a cleaned dataset from a csv

db = (ImageList.from_csv(data_path, 'cleaned.csv', folder='..')
                        .split_none()
                        .label_from_df()
                        .transform(get_transforms(), size=224)
                        .databunch()
)

learn_cln = cnn_learner(db, models.resnet34, metrics=error_rate)

learn_cln.load('stage-2');

I try to train the model again to see if the error_rate would improve with unwanted images filtered from the dataset:

learn_cln.fit_one_cycle(4, max_lr=slice(1e-5,1e-2))

This is what I get in return:

as opposed to:

when I train the model with the folder directly instead of a csv.

Please note that there is a file models/stage-2.pth as well as cleaned.csv. I can also use the learn_cln to predict a category:

img = open_image(cwd + '_test-predictions/pred-5.jpg')
img

pred_class,pred_idx,outputs = learn_cln.predict(img)
pred_class

This outputs a correct label.

I also tried removeing the learn rate limits without any change. If necessary, I can post my file / folder structure. I am running the notebook on AWS via SSH. Is anybody able to help me with this? Thanks a lot in advance!

Lukas

bwarner · February 12, 2020, 7:26pm

.split_none() That’s the issue. Your model doesn’t have a validation set to run any metrics on.

gisderdube · February 13, 2020, 5:01pm

Thanks a lot @bwarner! That makes sense! However, I’m not quite sure what we would achieve with that. I understand that this is used to clean out incorrect images and then saves that resulting list in a csv file. However, wouldn’t it be good to see how the “cleaned” dataset performs in comparison to the original dataset?

I tried removing .split_none() and now I get the error message

Exception: Your data isn’t split, if you don’t want a validation set, please use split_none.

Which also makes sense. I dug a bit in the docs and found the parameter valid_pct. Trying to create a databunch with

db = (ImageList.from_csv(data_path, 'cleaned.csv', folder='..', valid_pct=0.2 )
                    .label_from_df()
                    .transform(get_transforms(), size=224)
                    .databunch()
      )

as seen here . I now get the Error:

TypeError: init() got an unexpected keyword argument ‘valid_pct’

I assume this would split the databunch into training and validation set based on the percentage passed.

Again, thanks a lot for the help

bwarner · February 13, 2020, 5:34pm

I’m not entirely sure what you mean. The validation set is a representative sample of your data so you can monitor how well the model is performing.

Rachel Thomas has a good article on How (and why) to create a good validation set, which I recommend reading. An excerpt:

The underlying idea is that:

the training set is used to train a given model

the validation set is used to choose between models (for instance, does a random forest or a neural net work better for your problem? do you want a random forest with 40 trees or 50 trees?)

the test set tells you how you’ve done. If you’ve tried out a lot of different models, you may get one that does well on your validation set just by chance, and having a test set helps make sure that is not the case.

The documentation you linked to is for ImageDataBunch while you are using the datablock api. So you will want to use one of these methods.

gisderdube · February 17, 2020, 10:21am

@bwarner Thanks a lot for your detailed response again!

I knew about the necessity of train and validation set, but I’ll make sure to read through the article. I thought passing valid_pct=0.2 to ImageList.from_csv() would split the dataset into 80% train set and 20% validation set. The hint that I’m actually dealing with a datablock and not ImageDataBunch solved the issue

db = (ImageList.from_csv(data_path, 'cleaned.csv', folder='..')
                    .split_by_rand_pct()
                    .label_from_df()
                    .transform(get_transforms(), size=224)
                    .databunch()
      )

and then

learn_cln.fit_one_cycle(4, max_lr=slice(1e-5,1e-2))

worked! With the now cleaned dataset, the error rate was reduced from originally 15% to now 5%, so I’d say that’s a very nice improvement!

Thanks again and feel free to close this topic now!

EDIT: With 10 epochs, the error rate came down to 2,8%!