I have been reading up on this a little more after the insights from the 1st place winners. Here is the best stuff I could find, and I think we have the explanation now. So we have compounding problems in python itself (copy-on-access), pytorch (way multiprocessing is used) and fastai (too large objects, not wrapped “correctly”).
The core is this: There is no way of storing arbitrary python objects (such as pandas dfs, dicts of Pathbojects, or even simple lists) in shared memory in Python without triggering copy-on-write behaviour due to the addition of refcounts, everytime something reads from these objects. The refcounts are added memory-page by memory-page, which is why the consumption grows slowly, whereas by spawning the processes (and/or copying the entire objects in the beginning) it would jump up immediately. Either way, the processes will end up having all/most of the memory copied over bit by bit, which is why we get the memory overflow problem. Best description of this behaviour is here.
-> Hacky Workaround Solution 1: Check the memory consumption of your main process -> Devide total free mem by this and set the number of workers to the resulting number (absolute maximum). This the background for @larcat’s num_workers=2 solution working for him I assume. It is also why you don’t ever get these problems with huge mem (like @hwasiti showed on gcp), because as long as num_workers x total_mem_of_main_process < total_mem_available, everything is fine!
-> Hacky Workaround Solution 2: Make sure the main process occupies as little memory as possible, by a) not storing lists of Path-Objects (i.e. using
.from_csv methods and not
.from_folders as suggested by Jeremy) and b) removing any unneccessary intermediate objects/lists/stuff from your main process (i.e. using del) and c) running gc.collect() before starting the fit process, so before the workers get forked (not sure if b) and c) really helps much, but it can’t hurt)
Real Solutions (not tested yet)
-> A) Using Multiprocessing like now: in order for python multiprocessing to work without these refcount effects, the objects have to be made “compatible with” and wrapped in
multiprocessing.Array before the process pool is created and workers are forked. This supposedly ensures, that the memory will really be shared and no copy-on-write happens. This explains how to do it for numpy arrays and this explains the reasoning behind it again. Don’t get confused by some false statements even by the authors of these good answers stating that copy-on-write makes all of this unneccessary, which is not true. One comment also points to this:
“Just to note, on Python fork() actually means copy on access (because just accessing the object will change its ref-count).”
-> B) Using external tools/managers for storing the shared access objects, instead of storing them in the main process and the forked processes. Solutions could be the Pyro library as mentioned by the winners, but also something like Redis might be interesting. @vitaliy has experimented with this in the context of this competition almost 2 months ago, unfortunately without any replies, we should take a closer look at that I think!
Disclaimer: I am not an expert in any of this, just followed a lot of stack overflow link trails
If you think it is useful maybe I should split this long post out into a separate topic, after working in some of you guys’ comments/corrections.