I have a 184GB bcolz file which is 13GB compressed. I want to split into train/valid and run a forward pass on a number of pretrained models. Does anyone have any useful tips for doing this at this scale?
Currently I am splitting using an index to randomly access train/valid subsets. Hence every batch of 64 needs 64 reads/decompresses. I am doing a .copy() on the bcolz file to ensure the random access is in memory rather than disk. However it takes about 50 minutes for the read; and then 2.25 hours for a resnet18 forward pass on a p2.xlarge.
I am wondering if it would better to first split into train test reading the whole file sequentially, appending each row to train with probability .9 and test.1. Then for actual training I could read sequential batches.
Another possibility is to split the file into say 40GB chunks. Then convert each chunk to numpy and do the forward pass. Then split the output into train test. This is a smaller file but still potentially bigger than RAM when decompressed.
Any other suggestions?