ImageClassifierData.from_csv() --val_idxs parameter is not returning the validation data set as expected

geetha.ai · April 12, 2018, 12:08pm

Hi I am using ImageClassifierData.from_csv() to read the data.It has a parameter val_idxs- index of images to be used for validation. e.g. output of get_cv_idxs.If None, default arguments to get_cv_idxs are used.

Problem: I have divided the training dataset in a stratified manner(classification problem) for the purpose of validation data & supplying the original training data indexes of sampled validation data to the val_idxs argument for the data object(ImageClassifierData.from_csv()). I have expected that same validation data as sampled one will be selected via val_idxs.But it returns a different validation data set but with same length(because size is decided by val_idxs size, it is same)

Reasons I could think for the problem:
I think shuffling of the input data is happening first and after it is shuffled the validation indexes provided are selected as validation data in the shuffled data set. Thus a different sample of data is selected from the training data set.

Can I have a way around this. Can I directly give the validation data to the data object.

Thanks in advance for any help/suggestions

Reason found please look into the comments below

digitalspecialists · April 12, 2018, 2:50pm

The way I do this, particularly for long running models, is to stratify k fold the data in the full training csv, and persist each of the train and val folds (as index or values) to csv or dataframe etc. I then use these val_idx’s without running get_cv_idxs(). It also means I can use these same folds for separate ensembled models for comparison.

geetha.ai · April 13, 2018, 11:31am

@digitalspecialists thanks for your time. But I didnt understand the following statement clearly

“is to stratify k fold the data in the full training csv, and persist each of the train and val folds (as index or values) to csv or dataframe etc”

Can you clarify this

geetha.ai · April 17, 2018, 2:16pm

I found the reason for this.

from_csv method of class ImageClassifier initially calls & then samples from the fnames returned from csv_source function
`fnames,y,classes = csv_source(folder, csv_fname, skip_header, suffix, continuous=continuous)

def csv_source(folder, csv_file, skip_header=True, suffix=’’, continuous=False):
fnames,csv_labels = parse_csv_labels(csv_file, skip_header)
return dict_source(folder, fnames, csv_labels, suffix, continuous)

def parse_csv_labels(fn, skip_header=True):
“”"Parse filenames and label sets from a CSV file.

This method expects that the csv file at path :fn: has two columns. If it
has a header, :skip_header: should be set to True. The labels in the
label set are expected to be space separated.

Arguments:
    fn: Path to a CSV file.
    skip_header: A boolean flag indicating whether to skip the header.

Returns:
    a four-tuple of (
        sorted image filenames,
        a dictionary of filenames and corresponding labels,
        a sorted set of unique labels,
        a dictionary of labels to their corresponding index, which will
        be one-hot encoded.
    )
.
"""
with open(fn) as fileobj:
    reader = csv.reader(fileobj)
    if skip_header:
        next(reader)

    csv_lines = [l for l in reader]

fnames = [fname for fname, _ in csv_lines]
csv_labels = {a:b.split(' ') for a,b in csv_lines}
return sorted(fnames), csv_labels

Here the function parse_csv_labels returns a sorted list of filenames. When we provide a prefixed set of val_idxs(based on order in input_csv ie., indexes we want to sample from original csv input) it is sampling filenames from this sorted list. Hence it when an input of val_idxs is provided we don’t get the expected validation data set instead we get the validation data set of same size with pointing to different filenames.

@jeremy I am not sure if I could tag you on here & spam. But just wanted to understand if this is done for a reason or need to be corrected. Thanks so much for the course. I am highly indebted to you for creating something like this course.

`

jeremy · April 17, 2018, 2:41pm

Thanks for letting me know - I didn’t see this thread since it was in the old 2017 course forum; I’ve updated it now.

It would be great if you could try to fix this issue, and then create a little notebook demonstrating the problem, and show that the fix works correctly - then pop that into a gist. If you manage to get it working, you could then send in a PR and we’ll be all fixed up!

basu · April 18, 2018, 2:15am

Hi is there a way to change the test folder after training the model, rather than changing it in ImageClassifierData every time ?

geetha.ai · April 18, 2018, 4:38am

Hi @basu I think learn.predict_dl() could work.

geetha.ai · April 18, 2018, 4:40am

Sure @jeremy I will share the notebook today. But I am not sure how to fix that problem inside fast ai library.

tschoy · May 2, 2018, 11:44am

Hi, I met the same problem as well, and got here by searching in the forum. I’m glad @geetha.ai had found the cause. In my case since I’m doing multi-labels and over-sampling, fixing from_csv seem to be the best bet.

The only change needed seems to be line 126, from:
return sorted(fnames), list(df.to_dict().values())[0]
to
return fnames, list(df.to_dict().values())[0]

Can’t see the reason for sort, so I just remove it.

My notebook is a copy of lesson2, but with val_idxs set by hand:

Original result:

New result (after removing the sort):

Hope I didn’t miss anything.

geetha.ai · May 2, 2018, 4:55pm

@tschoy Yea removing sorted works. I will prioritize submitting a PR and will do that in next two days.I was hung up in going through lectures & work at office.

annahaz · July 5, 2018, 5:26am

hey @geetha.ai ,
I am new to fastai, and I am having some issue running from_csv. would like to read the notebook you shared. would you please tell me where I can see the notebook? thanks