Dear Gerardo,
I guess you can only try what has been mentioned so far:
- pandas with chunking and multiple CPUs (this is already implemented in the imdb notebook)
- dask
- get an online machine capable of handling the huge data files with the above options (i.e. with a lot of RAM).
Here are some links I was going through when I had a similar problem:
(However, in the end for this project proper setup of the pandas data types did the trick for me.)
If you have 120k files with 60GB you could try to use dask to load them into one dask df and save it and then load it again with pandas incl. chunking and multi CPU.
I guess it will be trick if you need to apply functions to the data which are based on the entire data set.
I’m currently trying to figure out how to load my 3GB data set to the GPU with 8 GB RAM, as I’m getting an CUDA out-of-memory error.
This will be very likely a problem you will also encounter with your 60 GB file.
Best regards
Michael