Speed up Keras fit generator process

I have a 5G csv file of data which I would like to use to train an autoencoder in Keras. As per suggestions in the Keras github issues, I used a generator to read from this file-

def my_generator():
    # Create empty arrays to contain batch of features and labels#
    while True:
        un_training_data = pd.read_csv('swissprot_bacteria_seqs.fa_trigrams', names = sorted(all_tri_grams), 
                                       chunksize=128)
        for index_in_chunk, chunk in enumerate(un_training_data):
            # taking the values from panda into numpy array
            batch_features = chunk.values.astype(float)
            #print
            #print np.mean(batch_features)
            yield batch_features, batch_features

autoencoder_4000_bn.fit_generator(my_generator(), samples_per_epoch = 332926, nb_epoch=10,
                                 callbacks = [checkpoint])

But this is much slower than if I call the Keras fit function over a numpy array. I think the pandas read function is the bottleneck here. Obviously, I cannot take a 5G file into a numpy array. So, can anyone please suggest how can I speed up this process?

Are you sure you can’t read a 5 GB file into memory?

Try dask.dataframe.read_csv instead of pandas.read_csv if you really can’t. Then you can use the normal Keras.fit method instead of fit_generator.

1 Like

Many thanks. I tried with dask and it worked. The training process is way faster now.

Hi @nafizh, I’m also trying to read pandas files batch wise. I have similar code as yours.No matter how many times I run next(my_generator()) , it always returns the first 128 rows.How to know for sure that keras is going through all data

Hi guys.

How are you able to execute Keras.fit() when using dask dataframes?
I get this error when trying to fit:
ValueError: ('Arrays chunk sizes are unknown: %s', (nan, 28))

The code I’m running is this:


import numpy as np
from keras.models import Sequential
from keras.layers import Dense
import dask.dataframe as dd

df = dd.read_csv('HIGGS.csv', header=None)

x = df.drop([0], axis=1)
y = df.drop([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28], axis=1)

model = Sequential()
model.add(Dense(1, input_dim=28, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='sgd',)
model.fit(x,y)

I have only used Dask arrays, not Dataframes.

You would need to find a workaround for this issue (unknown chunk sizes):

From Dask.dataframe
You can create dask arrays from dask dataframes using the .values attribute or the .to_records() method.

x = df.values
x = df.to_records()
However these arrays do not have known chunk sizes (dask.dataframe does not track the number of rows in each partition) and so some operations like slicing will not operate correctly.

http://dask.pydata.org/en/latest/array-creation.html

Using a Dask array in Keras requires the ability to slice and index, so unknown chunk sizes won’t work.