Create validation set when using from_path format

SnakeOne · May 15, 2018, 4:53am

Hi,

I’ve been looking at the kaggle Plant Seedlings Classification competition and the problem I ran into is that it uses the format, where each class has its own folder, but there is no Valid folder. So I think that I can’t use the cross-validation method that Jeremy used in the dog-breeds competition. So I was wondering if there is maybe some part of fast.ai library that can create valid folder and maybe move 20% images there or something of that sort?

Thanks

radek · May 15, 2018, 6:38am

Here are two methods that can be useful. They both live in ‘dataset.py’

def create_sample(path, r):
    """ Takes a path to a dataset and creates a sample of specified size at <path>_sample

    Parameters:
    -----------
    path: dataset path
    r (float): proportion of examples to use as sample, in the range from 0 to 1
    """
    sample_path = path + '_sample'
    shutil.rmtree(sample_path, ignore_errors=True)
    subdirs = [os.path.split(p)[1] for p in glob(os.path.join(path, '*'))]
    copy_or_move_with_subdirs(subdirs, path, sample_path, r, move=False)

def create_val(path, r):
    """ Takes a path to a dataset and creates a validation set of specified size

    Note - this changes the dataset at <path> by moving files to the val set

    Parameters:
    -----------
    path: dataset path
    r (float): proportion of examples to use for validation, in the range from 0 to 1

    """
    val_path = os.path.join(os.path.split(path)[0], 'valid')
    subdirs = [os.path.split(p)[1] for p in glob(os.path.join(path, '*'))]
    copy_or_move_with_subdirs(subdirs, path, val_path, r, move=True)

SnakeOne · May 15, 2018, 8:41am

@radek
I am sorry but I can’t find any of these functions in the dataset.py file. I checked even the most recent version of the file on github, but I can’t find any of these functions in there. Is this the file you are referring to? https://github.com/fastai/fastai/blob/master/fastai/dataset.py

Thanks

radek · May 15, 2018, 9:09am

Ah sorry about this - this is code I wrote that didn’t get merged or that I didn’t create a MR as I thought that maybe it is not significant enough to be part of the library. Please feel free to pick whatever you might need from my fork.

There used to also be threads on this forum how to do this with bash scripts - you might want to look for them if that is the route you would like to take.

radek · May 15, 2018, 9:32am

Not easy to find the thread Here is where some of this is discussed in the context of creating a sample. But there should be other posts / threads that discuss this and share scripts.

SnakeOne · May 15, 2018, 7:07pm

Wow thanks a lot for very useful answers.
I stole the create_val function and it seems to be working well so thanks a lot for that

I will probably get to writing my own at some point. I was thinking about partially rewriting fast.ai library as the final assignment of part 1. Do you think it is a good idea?

radek · May 15, 2018, 8:54pm

I had a similar idea at some point and even wrote quite a bit of code

I think this boils down to what your motivation for doing so might be. For me this seemed like the best project I could undertake at that time to learn. I tried working with the fastai library but was overwhelmed by it, didn’t know enough PyTorch and didn’t understand the code conventions (which I now came to really appreciate). I think this worked in that through this activity I am now at a level where I can fairly comfortably work with the fastai lib.

To be honest, at this point I believe it doesn’t really matter what you write as long as you keep writing code. And the road to this leads through coming up with projects to work on. If you feel that this might be a good idea, by all means give it a go

AbhigyanBose · June 21, 2018, 8:35am

Hi, I am working with the same dataset. I have divided the dataset properly myself. I am confused if the training is being done for all of the classes or only a couple of them.