Valid_pct vs split_by_rand_pct


(Shivam) #1

So I prepare the same dataset using the following 2 methods but I am getting different results:

method 1

data1 = (
    vision.ImageList.from_folder(path / "Training")
    .split_by_rand_pct(seed=1995)
    .label_from_folder()
    .transform(size=size)
    .databunch(bs=bs)
)

method 2

data2 = vision.ImageDataBunch.from_folder(
    path,
    train="Training",
    valid_pct=0.2,
    size=size, bs=bs
)

Method 1 give me the following data

ImageDataBunch;

Train: LabelList (672 items)
x: ImageList
Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227)
y: CategoryList
kinky,kinky,kinky,kinky,kinky
Path: ../data/hair/Training;

Valid: LabelList (168 items)
x: ImageList
Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227)
y: CategoryList
wavy,curly,curly,braids,kinky
Path: ../data/hair/Training;

Test: None

whereas, method 2 gives the following

ImageDataBunch;

Train: LabelList (840 items)
x: ImageList
Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227)
y: CategoryList
Testing,Testing,Testing,Testing,Testing
Path: ../data/hair;

Valid: LabelList (210 items)
x: ImageList
Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227),Image (3, 227, 227)
y: CategoryList
short-men,Testing,short-men,kinky,braids
Path: ../data/hair;

Test: None

Total training items are 840. Method 1 gives me a split of 80:20. Method 2 keeps the training data as is and adds a validation data with 20% of images. Why is this inconsistency present here? Is this the desired behaviour. To me method1 seems correct, but if there is any reason behind this then I’d like to know.


(Sanjay Ashok) #2

(Shivam) #3

Thanks for the link man. Helped me a lot!!!


(Rohit) #4

Can you pls share the solution to your query? I am facing the same problem.
I clicked on the link shared by @sanjay.ashok but not able to resolve the problem.


(Shivam) #5

Hey @Rohitagarwal257, so that is an actual bug (the github issue filed is here - https://github.com/fastai/fastai/issues/1552) in the fastai library which the devs aren’t sure how to solve. I went ahead with the first method. It’ll work as expected, just the trained models will be saved in the Training directory instead of the root. To make the second method work, just supply the complete path of the Training folder. Note that, the option train="Training" does not seem to work in this case.