I was just watching the part of recreating DataLoaders from scratch and reminded me of an issue I have experienced while working on StarSpace model implemented in PyTorch. StarSpace is rather generic but to illustrate the problem, think of it as a mimicking word2vec and taking 5 token ids + label id as a single training example.
Now, the problem is with overhead of loading the data using pytorch’s Dataloader abstraction. The way Dataloader (and Dataset) are designed is that you have to “touch” every training example separately (because
__getitem__ takes a single index as an argument) and that ends up killing the performance of dataloading. In my specific scenario, I was training on around 1 billion examples and the training was taking over 5 days. The GPU utilization was in low single digits, though. When I looked more closely, I realized that training was entirely CPU-bound and due to overhead of Python. As in, even if you served constant data (from an array), you still would get low performance due to Python’s overheard.
I wonder if other people ran into an issue like this? I believe it naturally arises anytime your examples are small (which is often the case for categorical data).
PS. I ended up fixing this problem and got training down to 37 hours.