Bcolz vs datagenerator


I am struggling with the following points:

  1. When should bcolz be used instead of keras’ data generator? Looks like the keras’ model has apis to accept an array with batch or define the data generator as well.
  2. Is there a performance improvement when using bcolz with fit() api over using a data generator with fit_generator()?

There’s another post mentioning dask at this post

  1. Is dask better than bcolz?


  1. bcolz should be used when you need to precompute the images and store them in array format. This saves time and memory later during training as you need to precompute it once rather than everytime you use the same data.

  2. I can’t say more about performance improvement as I never measured it and I always prefer fit_generator() over bcolz.

I think bcolz uses dask internally. Its still debatable. I will be happy if someone here can help understanding the practical use cases between “dask” and “bcolz”.

1 Like

bcolz is just a disk based file format that allows you to manage more data than fits in memory.

fit_generator is often used with image_data_generator that batches up images already stored on disk. In that case there would be no need to convert them into bcolz or anything else. However it can also be used with any other generator e.g. a bcolzarrayiterator if you have large amounts of data that is too big for memory.

Another use is if you want to predict a CNN layer and the output is too big for memory. In that case you can write the output to a file as you go along such as a bcolz array.

dask is for managing tasks in parallel. it has native formats and can read bcolz arrays.