How to use Pytorch dataset

I have a large number of small images (about 2.5m) stored as numpy arrays in a bcols file along with the classification of each image.
I have created a Pytorch dataset to access them. How do I use that with the fastai library?

1 Like

I’m in a similar situation. Any pointers about converting a set of pytorch datasets/dataloaders into fastai dataset classes?

@AlanJ @echon - Can you see if this helps? I have not tested it end-to-end yet. Can you guys try and let me know if this solves your problem?


So you have now created an object of class ModelData. How do you use that with one of the pretrained models?
ConvLearner is a sub class of Learner and that class has some documentation (at last) that says that the data is an object of type ModelData! So it should work. Thanks.

But does it have to be so difficult?


I have created a github issue for this. There may be a simpler way to address this -

Also, if you have just Numpy Arrays, you can use ArraysIndexDataset directly. There’s no need to wrap in Pytorch Dataset.

1 Like

So the order is

  • create a fastai DataSet
  • create a fastai DataLoader
  • create a fastai ModelData
  • feed that to your Learner

And JH says that there is a lot more things to do using Keras!


1 Like

Thank you very very much. I’ll try it and keep you update it about the results

If you are dealing with Images, you can use sortcuts to do all that for you -

md = ImageClassifierData.from_arrays(...)
md = ImageClassifierData.from_paths(...)
md = ImageClassifierData.from_csvs(...)

or more generally, you can create Dataset and DataLoader then feed using
md = ModelData.from_dls(..)


The size of my data will just about fit into memory so I can probably use

md = ImageClassifierData.from_arrays(…)
otherwise i will write my own DataLoader.
I do not suppose there is a manual around that I am missing. It seems such a shame that it is
so hard to do anything ever so slightly away from what was covered in the lectures when it all seems to be there.
Anyway thanks for all your help.


I’m not sure why you’re saying that. from_arrays already exists and does exactly what you want AFAICT in a single line of code?

If they’re in bcolz arrays they don’t need to fit in memory.

Isn’t this far less work than keras? In addition, in keras to fine-tune a model using the method shown by the author of the library is hundreds of lines of code. It also runs far slower, and is less accurate.

It seems churlish to critizise when you are providing so much but I find your very concise programming style very hard to read. I am sure that you use exactly the same variable names
throughout your library so it is easy for you to jump in to your own code and quickly pick up the thread but I personally find it very difficult. The only way I can make progress is to unravel your multi-statement lines and write liberal comments which you almost never use.

If Fastai is a comprehensive library as it seems to be then you cannot cover all of it in the lectures so documentation is essential. Even a doc string would help.

I understood that your whole philosophy with the library was for users to concentrate on solving the problem, not the maths or the programming.

I have been playing with deep learning ever since 2002 when I stumbled on Tesauro’s paper on learning backgammon so I am pretty familiar with the concepts. I have been programming (as an amateur) since I finished a degree in Computer Science at Cambridge in 1971 (yes I am that old!) so although I certainly am not a professional programmer I probably have more experience than most of your students. Therefore if I am finding it tough to find my way around your library I expect I am not alone.

As I said before I think this is a shame because I think you are so close to succeeding in your aim.

As for comparing Fastai to Keras - I am not. I discovered Keras after fighting raw Tensorflow for a while and it was a breath of fresh air. I immediately wrote myself a very small library that also allows me to create and train a model in a couple of lines of code though at that time I had not realized the power of using pretrained models. Fastai obviously has a lot more features built in than Keras.
I should not have made the comparison but you had showed Keras to be verbose and I was surprised that to supply data to a model I have to create DataSet, DataLoader and ModelData objects which superficially at least looks a little long winded.

I thoroughly enjoyed your 2017 lectures and I am looking forward to the 2018 version of part2.

1 Like

Do you mean a pytorch dataset or dataloader? Do you know of an example notebook that does this?