ZeroMQ library

balnazzar · November 18, 2018, 5:33pm

Hi folks.

I have issues with the ZeroMQ library on Ubuntu 18.04.
It is a library for asynchronous messaging at low level, and the system makes heavy use of it for interprocess communication as you work with fastai and pytorch.

As I meddle with data inside my notebooks, the kernel sometimes hang (even if there is plenty of available resources).

If I call a top during data intensive operations, there are several ZeroMQ processes, and that’s normal: the threads do communicate.
But then, sometimes, the notebook kernel hangs, and I can see, in top , a single ZeroMQ process occupying 100% of one core. The process is unkillable (even zombified, it continues to suck up 100% of one core).

This problem happens randomly, but if the data-intensive operation is long, you can rest assured it will happen: for example I am unable to finish the tokenization process over a large corpus due to this.

I even reinstalled my 18.04 from scratch: now the system just has fastai 0.7 an 1.x on it, but it still persists.

My question is: did you encounter the same problem? I’m desperately trying to root out the causes.

Thanks!

elruso · November 20, 2018, 11:15am

check it https://stackoverflow.com/questions/29212006/random-freezing-hanging-in-python-zeromq

balnazzar · November 20, 2018, 7:10pm

Thanks for your reply.

None the less, the stackoverflow thread is from the perspective of a programmer who needs to use ZMQ directly.

Since the issue happens as I manipulate data (not even feeding them to a learner, so I think pytorch is blameless), I think that should be addressed at fastai level.
The more data you manipulate, the more it happens, always as there is multiprocessing stuff.

It happens on two of my personal DL boxes, on a P3 large aws instance, and even on a Nvidia DGX Station I use at work, so I think the hardware is blameless as well.

elruso · November 20, 2018, 7:20pm

so you need to look in this side performance - Which Python memory profiler is recommended? - Stack Overflow may be something leaking on fastai side Search results for 'memory leak' - fast.ai Course Forums bcoz:

marcmuc · November 21, 2018, 9:05am

We have forum-chatted about this topic before, and I have no real solutions, so I had hoped someone else could contribute more insights to this thread…

But anyways, to maybe help with “forensics” a bit, I just wanted to point out that the problem could very well have nothing to do with either fastai or pytorch but could be in some other module you use and maybe lots of other people do NOT use, hence have no problems.

Check this in the jupyter docu. Jupyter notebooks themselves use ZeroMQ to communicate with the python kernels. So even if you have no deep learning lib involved there will be zeromq processes and if something goes wrong there, it could be just something “random” in a jupyter notebook.

I have myself had big problems with jupyter this weekend and had to hard-reset my computer twice to get out of it (but I couldn’t even look at top or ps anymore so don’t know if there was a hanging zeromq process)

I had the distinct feeling that this was caused by my heavy use of feather files for reading data ( close to 2GB file). Sometimes it opened in seconds and sometimes the fan would come on roaring and the machine would die. So I have stopped using feather for now. But I had been using it without problems for months, so I don’t know whether the issue is size or some new incompatibility introduced by some update somewhere.

I also had a problem when using widgets and many matplotlib plots in a jupyter notebook I could not save the notebooks anymore. The kernels would run, I could execute the code, but I could not save, make copy, save as or duplicate. All of those would lead to hanging jupyter browser windows. only restarting jupyter AND the browser solved these.

So my point is, maybe there is some bug/instability on the jupyter side. You could try downgrading jupyter to some version of a few months ago and check whether your problems persist?!
Sorry if this is not the help you are looking for…

balnazzar · November 21, 2018, 8:41pm

Same here. It is a big hindrance, and nobody seems to care about it.

Just to leave jupyter in itself out of this, I exported my NB as python source, and ran it (the part of lesson10/pt2 which tokenize the text corpus. Just, I ran it not over IMDB, but on a big fat wikipedia dump of mine).

Same outcome. A single, unkillable zmq process suckling 100% of one core.

To be honest, since it seems the problem persists independently of jupyter, I just downgraded libzmq (from .5 to .2, the minimum available as conda package. The same for pyzmq). Same result

No you have been very helpful, at least now I know it’s not sometimes related to my DL setups. May you elaborate a bit about feather, and how did you manage to get rid of it in your workflow? Which parts of a typical fastai 0.7/1.0 make use of feather?

Thanks!

marcmuc · November 21, 2018, 10:12pm

So of course if the problems persist if you are not even using jupyter than that part is out of the question. I had to google this agian and came up with this which in turn lead me here (the zeromq faq). This basically says that zeromq does not protect against memory overflow and that messages need to fit in RAM. It also says that messages take RAM on both sides (sender and receiver) which could mean the memory foodprint potentially gets multiplied. So if you were using huge files/corpi then maybe the process dies uncontrolled due to memory issues?! You could check by gradually increasing the size and seeing if that changes something and also monitoring RAM through system tools. I guess this is a long shot, but if this were the issue then something about the handling/loading of the data on the pytorch or fastai side would be very strange…

Does ZeroMQ buffer my entire message in memory before sending to its recipients?

Yes. So if you are sending multi-gigabyte messages, it will consume RAM to store those messages on both sides. The memory on the sender side is freed after last part of the message has been sent. Also, see point above.

Is it possible to receive EAGAIN or to block when sending a multi-part message?

Multipart messages are sent as an atomic unit. If one part is sent successfully, then the socket is guaranteed to not block or return EAGAIN until all parts have been sent. Of course, if you run out of memory in the meantime then the guarantee doesn’t mean much, so don’t try to send messages that exceed available memory.

Are you trying to tell me that ZeroMQ won’t magically save me from out-of-memory conditions?

You are a perfect candidate for using AMQP or JMS. Please use Google to find more information on libraries that support those platforms and protocols.

marcmuc · November 21, 2018, 10:21pm

re feather: I always use it when I deal with e.g. especially csv files, which are quite slow to load. So I do my modifications to the csv (like pd.to_datetime etc.) and the I save it as feather. In the future I only use the feather file instead of the original csv, as it loads orders of magnitude faster and retains all (well, most) datatypes and properties of the original dataframe. So there is much less manipulation required after reading it back in (I basically only set the index).

Getting rid of it means ‚simply‘ using other storage formats like pickle, zipped pickle, or hdf. All of which have their own pros and cons. Feather is just very convenient and its also good for exchanging data in pipelines wich include R users (even if we don‘t like them ). And possibly large scale stuff that support the arrow format (although that was never my usecase so far).

Reading the memory stuff above just now it could very well be that my problems had to do with the size of the files I was handling (and possible memory issues) and would explain why I never had those problems before (having used mostly much smaller files)

So the solution seems: Get more RAM

balnazzar · November 22, 2018, 12:32pm

On my personal machine (64gb), it occupies all the ram, and part of the swap, then hangs up.

But the DGX Station has 256Gb of RAM. As the process hangs up, no more than 90Gb are occupied (on the average upon various attempts).

I’m afraid that won’t solve the problem: see the DGX example.

Thanks for your clarifications about feather, but I think it’s blameless as well (I don’t use it).

I even tried to use just ONE thread for processing the entire corpus: apart from the slowdown, as a result, it complains saying it cannot serialize any object larger than 4Gb.

You know, I’m beginning to feel frustrated by the fact that if you want to do deep learning today, you are basically caged in a single, strict ecosystem: nvidia hardware and libraries on Linux. No choices whatsoever about hardware, libraries, and OS.