[Solved] What's the easiest way to get the list of file names in the training dataset

poppingtonic · October 24, 2018, 9:54pm

I’m trying to see what the limits are to progressive resizing, and would like to plot out the number of images of a particular size, like in the screenshoted example from last year’s course below. Given a DataBunch, what’s the easiest way to get the filenames used so I can plot their sizes out?

This screenshot is from last year’s course, so the API is different.

poppingtonic · October 24, 2018, 10:03pm

Got it. I was diving into basic_train.py unnecessarily.

data.train_ds.ds.x

Returns a ndarray of all file names.

danield · October 25, 2018, 1:12am

Just to add to this, I enjoy working with dataframes (can easily filter on specific img requirements and pull the remaining filenames), so an alternative (for viewing both sides) could be this

jeremy · October 25, 2018, 2:27am

Note that this is in DatasetTfm:

    def __getattr__(self,k):
        "Passthrough access to wrapped dataset attributes."
        return getattr(self.ds, k)

Which means that you can remove the ds:

data.train_ds.x

jeremy · October 25, 2018, 2:41am

FYI, you can also do that somewhat more idiomatically like so:

df = pd.DataFrame([x.shape[1:] for x,y in data.train_ds], columns=('h','w'))

Note that a dataset’s items are tuples of the independent and dependent vars.

danield · October 25, 2018, 4:17am

Always appreciate the more pythonic way of doing things. This led to some great albeit obvious insights (and correct me if I’m wrong):

dataset items are tuples! i.e. (image (size), breed index), which led me to…
data.train_ds[0][0] (the image) is the transformed version of the original image data.train_ds.ds[0][0]
when viewing the cropped version in data.train_ds[0][0], it changes every time. I’m putting all the pieces together now and presume that the “DatasetTfm” contains the original dataset and a transformer “tfm”, which is applies to the original image (data.train_ds.ds) and stored in data.train_ds.

Apologies if I’ve just stated the obvious, but might be insightful for anyone else like me starting out.

jeremy · October 25, 2018, 4:22am

You are extremely not wrong, and have summarize this beautifully! I’m not sure we’ve done a great job of explaining this lower-level details in the docs, so if you want to really test your understanding, feel free to try adding more info to the docs to help others too! (And do let me know if you decide to do this, and want help understanding how to contribute).

joshfp · October 25, 2018, 5:02am

And what about getting the file names for the test set, since there is no test_ds property? Is it test_dl.dataset.x the way to go?

jeremy · October 25, 2018, 5:03am

Yes that’s right. Feel free to send in a PR to add test_ds if you want BTW

joshfp · October 26, 2018, 5:34am

Done! My first PR

mmiakashs · March 17, 2019, 12:13pm

This is much needed features. But do you know how can we retrieve the files name in test dataset?