ImageClassifierData.from_csv() --val_idxs parameter is not returning the validation data set as expected

Hi I am using ImageClassifierData.from_csv() to read the data.It has a parameter val_idxs- index of images to be used for validation. e.g. output of get_cv_idxs.If None, default arguments to get_cv_idxs are used.

Problem: I have divided the training dataset in a stratified manner(classification problem) for the purpose of validation data & supplying the original training data indexes of sampled validation data to the val_idxs argument for the data object(ImageClassifierData.from_csv()). I have expected that same validation data as sampled one will be selected via val_idxs.But it returns a different validation data set but with same length(because size is decided by val_idxs size, it is same)

Reasons I could think for the problem:
I think shuffling of the input data is happening first and after it is shuffled the validation indexes provided are selected as validation data in the shuffled data set. Thus a different sample of data is selected from the training data set.

Can I have a way around this. Can I directly give the validation data to the data object.

Thanks in advance for any help/suggestions

Reason found please look into the comments below

The way I do this, particularly for long running models, is to stratify k fold the data in the full training csv, and persist each of the train and val folds (as index or values) to csv or dataframe etc. I then use these val_idx’s without running get_cv_idxs(). It also means I can use these same folds for separate ensembled models for comparison.

@digitalspecialists thanks for your time. But I didnt understand the following statement clearly

“is to stratify k fold the data in the full training csv, and persist each of the train and val folds (as index or values) to csv or dataframe etc”

Can you clarify this

I found the reason for this.

from_csv method of class ImageClassifier initially calls & then samples from the fnames returned from csv_source function
`fnames,y,classes = csv_source(folder, csv_fname, skip_header, suffix, continuous=continuous)

def csv_source(folder, csv_file, skip_header=True, suffix=’’, continuous=False):
fnames,csv_labels = parse_csv_labels(csv_file, skip_header)
return dict_source(folder, fnames, csv_labels, suffix, continuous)

def parse_csv_labels(fn, skip_header=True):
“”"Parse filenames and label sets from a CSV file.

This method expects that the csv file at path :fn: has two columns. If it
has a header, :skip_header: should be set to True. The labels in the
label set are expected to be space separated.

Arguments:
    fn: Path to a CSV file.
    skip_header: A boolean flag indicating whether to skip the header.

Returns:
    a four-tuple of (
        sorted image filenames,
        a dictionary of filenames and corresponding labels,
        a sorted set of unique labels,
        a dictionary of labels to their corresponding index, which will
        be one-hot encoded.
    )
.
"""
with open(fn) as fileobj:
    reader = csv.reader(fileobj)
    if skip_header:
        next(reader)

    csv_lines = [l for l in reader]

fnames = [fname for fname, _ in csv_lines]
csv_labels = {a:b.split(' ') for a,b in csv_lines}
return sorted(fnames), csv_labels

Here the function parse_csv_labels returns a sorted list of filenames. When we provide a prefixed set of val_idxs(based on order in input_csv ie., indexes we want to sample from original csv input) it is sampling filenames from this sorted list. Hence it when an input of val_idxs is provided we don’t get the expected validation data set instead we get the validation data set of same size with pointing to different filenames.

@jeremy I am not sure if I could tag you on here & spam. But just wanted to understand if this is done for a reason or need to be corrected. Thanks so much for the course. I am highly indebted to you for creating something like this course.

`

1 Like

Thanks for letting me know - I didn’t see this thread since it was in the old 2017 course forum; I’ve updated it now.

It would be great if you could try to fix this issue, and then create a little notebook demonstrating the problem, and show that the fix works correctly - then pop that into a gist. If you manage to get it working, you could then send in a PR and we’ll be all fixed up! :smiley:

Hi is there a way to change the test folder after training the model, rather than changing it in ImageClassifierData every time ?

Hi @basu I think learn.predict_dl() could work.

Sure @jeremy I will share the notebook today. But I am not sure how to fix that problem inside fast ai library.

1 Like

Hi, I met the same problem as well, and got here by searching in the forum. I’m glad @geetha.ai had found the cause. In my case since I’m doing multi-labels and over-sampling, fixing from_csv seem to be the best bet.

The only change needed seems to be line 126, from:
return sorted(fnames), list(df.to_dict().values())[0]
to
return fnames, list(df.to_dict().values())[0]

Can’t see the reason for sort, so I just remove it.

My notebook is a copy of lesson2, but with val_idxs set by hand:

Original result:

New result (after removing the sort):

Hope I didn’t miss anything.

@tschoy Yea removing sorted works. I will prioritize submitting a PR and will do that in next two days.I was hung up in going through lectures & work at office.

1 Like

hey @geetha.ai ,
I am new to fastai, and I am having some issue running from_csv. would like to read the notebook you shared. would you please tell me where I can see the notebook? thanks