Bcolz_array_iterator

ben.bowles · March 19, 2017, 10:29pm

Hi there,

I was just curious, have people found this module anywhere? I can’t find it in the downloadable materials. If people have an idea I’d be very curious, thanks!

Ben

Matthew · March 19, 2017, 11:22pm

github.com

jph00/part2/blob/master/bcolz_array_iterator.py

import numpy as np
import bcolz
import threading

class BcolzArrayIterator(object):
    """
    Returns an iterator object into Bcolz carray files
    Original version by Thiago Ramon Gonçalves Montoya
    Docs (and discovery) by @MPJansen
    Refactoring, performance improvements, fixes by Jeremy Howard j@fast.ai
        :Example:
        X = bcolz.open('file_path/feature_file.bc', mode='r')
        y = bcolz.open('file_path/label_file.bc', mode='r')
        trn_batches = BcolzArrayIterator(X, y, batch_size=64, shuffle=True)
        model.fit_generator(generator=trn_batches, samples_per_epoch=trn_batches.N, nb_epoch=1)
        :param X: Input features
        :param y: (optional) Input labels
        :param w: (optional) Input feature weights
        :param batch_size: (optional) Batch size, defaults to 32
        :param shuffle: (optional) Shuffle batches, defaults to false

This file has been truncated. show original

iNLyze · April 16, 2017, 10:24pm

Has anyone had this issue when creating an on-disk carray:

I tried to create a bcolz carray shaped (11000, 500, 500, 3) and get this error. It works with about 8000 samples (i.e. 800050050034bytes ~ 24 GB). I get the impression that even though I activated “rootdir=mydirectory” intermittendly bcolz would like to create an np.array (look at the error output cited above). If that is true, that would be a major issue with using bcolz for larger than RAM data.

iNLyze · April 16, 2017, 11:49pm

Workaround: Do not pre-allocate the entire array, but create using

c = bcolz.carray((0, height, width, channels), rootdir=mydir, mode='w', **kwargs)
# ...
c.append(myarray)

shy · May 17, 2017, 5:19am

it seems that this iterator cannot make the multi-threaded works(through the parameter fit_generator workers due to the lock, is there any way to make it parallel? Thanks.

jeremy · May 19, 2017, 11:45pm

There’s not really any reason to, since it’s not doing any processing.

iNLyze · April 26, 2018, 8:18pm

I recently had another issue down the line of

        RuntimeError('fatal error during Blosc
        decompression: -1',) in
        'bcolz.carray_ext.chunk._getitem' ignored

I figured it must be related to several threads accessing the same carrays on disk. I had been preparing data in one notebook and reading from it in another using a generator. When I stopped one of the notebooks the error disappeared.