I have a 5G csv file of data which I would like to use to train an autoencoder in Keras. As per suggestions in the Keras github issues, I used a generator to read from this file-
def my_generator():
# Create empty arrays to contain batch of features and labels#
while True:
un_training_data = pd.read_csv('swissprot_bacteria_seqs.fa_trigrams', names = sorted(all_tri_grams),
chunksize=128)
for index_in_chunk, chunk in enumerate(un_training_data):
# taking the values from panda into numpy array
batch_features = chunk.values.astype(float)
#print
#print np.mean(batch_features)
yield batch_features, batch_features
autoencoder_4000_bn.fit_generator(my_generator(), samples_per_epoch = 332926, nb_epoch=10,
callbacks = [checkpoint])
But this is much slower than if I call the Keras fit function over a numpy array. I think the pandas read function is the bottleneck here. Obviously, I cannot take a 5G file into a numpy array. So, can anyone please suggest how can I speed up this process?
Hi @nafizh, I’m also trying to read pandas files batch wise. I have similar code as yours.No matter how many times I run next(my_generator()) , it always returns the first 128 rows.How to know for sure that keras is going through all data
How are you able to execute Keras.fit() when using dask dataframes?
I get this error when trying to fit: ValueError: ('Arrays chunk sizes are unknown: %s', (nan, 28))
The code I’m running is this:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
import dask.dataframe as dd
df = dd.read_csv('HIGGS.csv', header=None)
x = df.drop([0], axis=1)
y = df.drop([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28], axis=1)
model = Sequential()
model.add(Dense(1, input_dim=28, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd',)
model.fit(x,y)
You would need to find a workaround for this issue (unknown chunk sizes):
From Dask.dataframe
You can create dask arrays from dask dataframes using the .values attribute or the .to_records() method.
x = df.values
x = df.to_records()
However these arrays do not have known chunk sizes (dask.dataframe does not track the number of rows in each partition) and so some operations like slicing will not operate correctly.