By memory do you mean RAM or the disk? In the code I shared we only use ~1% of the 50 million examples available, I don’t recall now but at resolution 128x128 that is only a couple of GB.
All these files never get loaded into RAM. They are written one by one to disk. And during training they are loaded and transformed on a per batch basis.
v1 exposes many nice ways to work with image files on disk - those are the class methods of
vision/data.py starting around line 271.
In v2 of the course I remember Jeremy saying: ‘if you have an option to change the data into a format supported by the tool of your choice, go ahead and do it, it will save you a lot of hassle vs writing your own way of interfacing with the data directly’. There is a way to use all 50 million examples by generating them on the fly without storing anything to disk, but I heeded Jeremy’s advice and took the easy way out
As a side note, I do wonder if even with a big, well tuned model it makes a difference if you train on 50 million examples vs say 2 million with data augmentation Realistically, there must be a lot of redundancy in 50_000_000 / 340 drawings of snowman!