[Solved] What's the easiest way to get the list of file names in the training dataset

I’m trying to see what the limits are to progressive resizing, and would like to plot out the number of images of a particular size, like in the screenshoted example from last year’s course below. Given a DataBunch, what’s the easiest way to get the filenames used so I can plot their sizes out?


This screenshot is from last year’s course, so the API is different.

1 Like

Got it. I was diving into basic_train.py unnecessarily.

data.train_ds.ds.x

Returns a ndarray of all file names.

Just to add to this, I enjoy working with dataframes (can easily filter on specific img requirements and pull the remaining filenames), so an alternative (for viewing both sides) could be this

Note that this is in DatasetTfm:

    def __getattr__(self,k):
        "Passthrough access to wrapped dataset attributes."
        return getattr(self.ds, k)

Which means that you can remove the ds:

data.train_ds.x
1 Like

FYI, you can also do that somewhat more idiomatically like so:

df = pd.DataFrame([x.shape[1:] for x,y in data.train_ds], columns=('h','w'))

Note that a dataset’s items are tuples of the independent and dependent vars.

3 Likes

Always appreciate the more pythonic way of doing things. This led to some great albeit obvious insights (and correct me if I’m wrong):

  • dataset items are tuples! i.e. (image (size), breed index), which led me to…
  • data.train_ds[0][0] (the image) is the transformed version of the original image data.train_ds.ds[0][0]
  • when viewing the cropped version in data.train_ds[0][0], it changes every time. I’m putting all the pieces together now and presume that the “DatasetTfm” contains the original dataset and a transformer “tfm”, which is applies to the original image (data.train_ds.ds) and stored in data.train_ds.

Apologies if I’ve just stated the obvious, but might be insightful for anyone else like me starting out.

4 Likes

You are extremely not wrong, and have summarize this beautifully! I’m not sure we’ve done a great job of explaining this lower-level details in the docs, so if you want to really test your understanding, feel free to try adding more info to the docs to help others too! :slight_smile: (And do let me know if you decide to do this, and want help understanding how to contribute).

And what about getting the file names for the test set, since there is no test_ds property? Is it test_dl.dataset.x the way to go?

Yes that’s right. Feel free to send in a PR to add test_ds if you want BTW :slight_smile:

1 Like

Done! My first PR :slight_smile:

1 Like

This is much needed features. But do you know how can we retrieve the files name in test dataset?