How to format data/keras shape and define enumerate / input layer for keras

Hi All,

I have been doing the course for the last couple of months and loving it.

whilst i can follow the examples in the course, i thought i would go and try and create my own deep learning problem - as recommended by Jeremy. off piste in other words.

whilst the actual problem might take a while to explain, what i am trying to do is create a 1D conv input layer to then train on my data.

my aim is to associate a certain array with a certain ‘decision’ in my neural net. the input is codified as a list of 6 numbers, which the neural net must return a decision based on their order. This ‘essentially’ sorts the order.

I have spent a good 5hrs trying to nut out what i am doing wrong, and believe its something to do with

  • my array shapes
  • my input or encoding layer
  • my use of numpy or python arrays

heres a sample of my data and what i am trying to do.

[[array([2, 3, 4, 5, 0, 1]), array([2, 3, 1, 0, 4, 5]), array([0, 1, 2, 3, 4, 5])]

encoding
array([4, 1, 0]) 

essentially i want the neural net to learn that certain arrays require a certain step to order to the new array.

I have tried this approach but to no success

vocab_size = 7
model = Sequential([
 Convolution1D(nb_filter=32, filter_length=3, border_mode='valid', input_dim=128, input_length=6),
 -- Convolution1D(nb_filter=32, filter_length=3, border_mode='valid', input_shape = (6, 1), input_length=6),
        Dense (100, activation="relu"),
        Dense (vocab_size, activation='sigmoid')
    ])

when i try to fit the data

model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

trn[:2]

array([[2, 3, 4, 5, 0, 1],
       [2, 3, 1, 0, 4, 5]])

trn_labels[:2]

array([4, 1])

model.fit(trn, trn_labels, batch_size=4, nb_epoch=4)

ValueError: Error when checking model input: expected convolution1d_input_6 to have 3 dimensions, but got array with shape (61, 6)

My full data that i will be stepping through and training my net on is:

'[[array([2, 3, 4, 5, 0, 1]), array([2, 3, 1, 0, 4, 5]), array([0, 1, 2, 3, 4, 5])]
 [array([1, 0, 2, 3, 5, 4]), array([5, 4, 2, 3, 0, 1]), array([0, 1, 2, 3, 4, 5]), array([0, 1, 4, 5, 3, 2]), array([0, 1, 2, 3, 4, 5])]
 [array([1, 0, 3, 2, 4, 5]), array([1, 0, 4, 5, 2, 3]), array([5, 4, 1, 0, 2, 3]), array([5, 4, 2, 3, 0, 1]), array([5, 4, 0, 1, 3, 2]), array([5, 4, 2, 3, 0, 1]), array([0, 1, 2, 3, 4, 5])]
 [array([2, 3, 5, 4, 1, 0]), array([0, 1, 5, 4, 2, 3]), array([0, 1, 2, 3, 4, 5])]
 [array([5, 4, 2, 3, 0, 1]), array([5, 4, 0, 1, 3, 2]), array([2, 3, 0, 1, 5, 4]), array([2, 3, 5, 4, 1, 0]), array([4, 5, 2, 3, 1, 0]), array([4, 5, 0, 1, 2, 3]), array([4, 5, 2, 3, 1, 0]), array([0, 1, 2, 3, 4, 5])]
 [array([2, 3, 5, 4, 1, 0]), array([5, 4, 3, 2, 1, 0]), array([2, 3, 5, 4, 1, 0]), array([2, 3, 0, 1, 5, 4]), array([1, 0, 2, 3, 5, 4]), array([5, 4, 2, 3, 0, 1]), array([0, 1, 2, 3, 4, 5])]
 [array([4, 5, 2, 3, 1, 0]), array([0, 1, 2, 3, 4, 5])]
 [array([3, 2, 5, 4, 0, 1]), array([3, 2, 0, 1, 4, 5]), array([5, 4, 0, 1, 3, 2]), array([2, 3, 0, 1, 5, 4]), array([2, 3, 4, 5, 0, 1]), array([1, 0, 4, 5, 2, 3]), array([1, 0, 2, 3, 5, 4]), array([5, 4, 2, 3, 0, 1]), array([1, 0, 2, 3, 5, 4]), array([4, 5, 2, 3, 1, 0]), array([0, 1, 2, 3, 4, 5])]
 [array([1, 0, 3, 2, 4, 5]), array([4, 5, 3, 2, 0, 1]), array([0, 1, 3, 2, 5, 4]), array([2, 3, 0, 1, 5, 4]), array([2, 3, 4, 5, 0, 1]), array([2, 3, 1, 0, 4, 5]), array([0, 1, 2, 3, 4, 5])]
 [array([2, 3, 0, 1, 5, 4]), array([1, 0, 2, 3, 5, 4]), array([4, 5, 2, 3, 1, 0]), array([4, 5, 1, 0, 3, 2]), array([2, 3, 1, 0, 4, 5]), array([1, 0, 3, 2, 4, 5]), array([3, 2, 0, 1, 4, 5]), array([0, 1, 2, 3, 4, 5])]]`

and my validation data is



[array([4, 1, 0]) array([6, 6, 3, 4, 0]) array([3, 1, 3, 3, 4, 6, 0]) array([5, 3, 0])
 array([3, 5, 3, 1, 4, 3, 5, 0]) array([2, 1, 4, 1, 6, 6, 0]) array([5, 0])
 array([3, 5, 5, 4, 5, 3, 6, 5, 5, 5, 0]) array([6, 6, 1, 4, 4, 1, 0])
 array([1, 5, 3, 5, 2, 2, 2, 0])]

Any advice would be appreciated on how to approach this problem. This is a simple test case, before i step up the scale of what i am trying to achieve.

Thanks a heap in advance.

It looks like you have lists of arrays with inconsistent numbers of elements.

Keras can handle data with one or more dimensions of shape None, but any batch you feed to your mode needs to be expressive as a single numpy array (i.e. every row has the same number of elements, etc etc).

thanks @davecg
So what your saying is i either need my data say structured like this:

i) padded out to create same shape

input
[[array([2, 3, 4, 5, 0, 1]), array([2, 3, 1, 0, 4, 5]), array([0, 1, 2, 3, 4, 5]),array([0, 1, 2, 3, 4, 5]),array([0, 1, 2, 3, 4, 5])],
[array([1, 0, 2, 3, 5, 4]), array([5, 4, 2, 3, 0, 1]), array([0, 1, 2, 3, 4, 5]), array([0, 1, 4, 5, 3, 2]), array([0, 1, 2, 3, 4, 5])] ]

output validation
[ array([4, 1, 0, 0 ,0]),
array([6, 6, 3, 4, 0]) ]

or ii) flattened


input
[[array([2, 3, 4, 5, 0, 1]),
 array([2, 3, 1, 0, 4, 5]), 
array([0, 1, 2, 3, 4, 5]),
[array([1, 0, 2, 3, 5, 4]), 
array([5, 4, 2, 3, 0, 1]), 
array([0, 1, 2, 3, 4, 5]), 
array([0, 1, 4, 5, 3, 2]), 
array([0, 1, 2, 3, 4, 5])] ]

output validation
[ 4, 1, 0, 6, 6, 3, 4, 0 ]

Can you suggest a simple keras sequential model input that would deal with these two scenarios?
my best guesses are below, but i think ive mucked the up.

i) model = Sequential([
Convolution1D(nb_filter=32, filter_length=3, border_mode='valid', input_shape = (5, 6),  or is it this input_length=6),
Dense (vocab_size, activation='sigmoid')
])

ii) model
vocab_size = 7
model = Sequential([
Convolution1D(nb_filter=32, filter_length=3, border_mode='valid', input_dim=128, input_length=6),
input_length=6),
Dense (vocab_size, activation='sigmoid')
])

Think I’m not gettting exactly what you’re trying to do, but I can at least say what shape input and output you would get from your network.

Input would be (batch_size, length). Make sure your input is a single numpy array with that shape (not a list of arrays).

Conv output would be (batch_size, length-2)

Final output will be (batch_size, vocab_size). Make sure your labels match this shape for each batch.

You may want soft max instead of sigmoid on that last layer if you want the total output to sum to 1 (otherwise can add up to anything from 0 to vocab_size).

thanks @davecg I will give it a shot and post what works for me (for anyone else that reads this thread)

Hi @davecg and all (@jeremy) ,

So i had a bit of a break, and came back to this problem. I seem to be having a conceptual hurdle when it comes to understanding embeddings and output categorisations.

What i want is to train on data like this [3, 2, 5, 4, 0, 1] and return a digit in the range 0-6 inclusive, that represents my categories.

Thus i want the nn to learn that when presented with 6 digits it can learn to associate their features with a return digit in a way thats relevant to my model. Over time i want to give my model ways of working with larger data sets eg 0-20 or 40 etc. Essentially theres a sorting process thats going on that i want to try model, but im just using a simple case to start with

i have done a bunch of web searching and re-watching the course
I have been using this tutorial to try help, with getting my data in the right shape.
https://elitedatascience.com/keras-tutorial-deep-learning-in-python

you will see below i have tried reshaping my data, but i havent found it really works the way i think eg like this
MSFr = MSF.reshape(MSF.shape[0], 1, 6)

and reviewing my understanding of embeddings
http://wiki.fast.ai/index.php/Lesson_5_Notes#Keras_Functional_API_.3C00:23:40.3E

you can see my output i have my last layer as

Dense(1, activation='sigmoid')])

this just seems to give me a number between 0,1. What i thought i might get is a number 0-6.
i have tried the following but it complains that my output shape is wrong.

Dense(7, activation=‘sigmoid’)])

ValueError: Error when checking model target: expected dense_18 to have shape (None, 7) but got array with shape (6477, 1)

Do i need to turn my input labels into some sort of encoding for this to work. I think part of my problem is when not working with images in folders i have to do some data massaging to communciate the training_vals. one hot encoding or enumeration. I really dont knnow and my understanding isnt good enough.

I would really appreaciate some more help to understand how to work on this problem, or custom data in general

I have included my latest attempts at getting the data in and he predictions out of my model. I haven’t got a validation set here at the moment.

Thanks in Advance


#the data

MSF.shape
(6477, 6)

MSF[10:]
array([[4, 5, 3, 2, 0, 1],
       [3, 2, 5, 4, 0, 1],
       [5, 4, 2, 3, 0, 1],
       [5, 4, 1, 0, 2, 3],
       [1, 0, 4, 5, 2, 3],
       [4, 5, 0, 1, 2, 3],
       [3, 2, 0, 1, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [4, 5, 3, 2, 0, 1],
       [1, 0, 3, 2, 4, 5]])

MSFr = MSF.reshape(MSF.shape[0], 1, 6)
MSFr.shape
(6477, 1, 6)

#the labels

MSF_labels.shape
(6477,)

MSF_labelsr = MSF_labels.reshape(MSF_labels.shape[0], 1)

MSF_labelsr.shape
(6477, 1)

MSF_labels[10:]
array([2, 2, 4, 2, 2, 5, 2, 0, 5, 2])

MSF_labelsr[10:]
array([[2],
       [0],
       [4],
       ..., 
       [0],
       [1],
       [0]])

#the model
vocab_size =6 # the range of digits
seq_len=6 # the number of digits in each input sequence
latent_factors = 32 # the number of coeeficients we use to describe?
output_factors = 7

model = Sequential([
    Embedding(vocab_size, 16, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dense(1, activation='sigmoid')])

model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_8 (Embedding)          (None, 6, 16)         96          embedding_input_8[0][0]          
____________________________________________________________________________________________________
flatten_8 (Flatten)              (None, 96)            0           embedding_8[0][0]                
____________________________________________________________________________________________________
dense_15 (Dense)                 (None, 100)           9700        flatten_8[0][0]                  
____________________________________________________________________________________________________
dense_16 (Dense)                 (None, 1)             101         dense_15[0][0]                   
====================================================================================================
Total params: 9,897
Trainable params: 9,897
Non-trainable params: 0


model.fit(MSF, MSF_labels , nb_epoch=1, batch_size=4)
Epoch 1/1
6477/6477 [==============================] - 7s - loss: -29.4885 - acc: 0.1359     
<keras.callbacks.History at 0x11a883210>

i realise this is not very accurate and i haven't split my data to provide a validation set.  I didnt think that was an immediate problem. only went i want to try see how accurate my model is (once i get my input and output working)

# the output
x = MSFr[0]
print(x)
p = model.predict(x, batch_size=2, verbose=0)
print(p[0])

input
[[4 5 3 2 0 1]]
[ 1.]

what i want is not 0 or 1 but a 0-6 number

Your labels need to be a “one hot” encoded matrix (batch size by # of categories).

Also, use softmax, not sigmoid, assuming you can only have one category be true at a time.

Hi all and @davecg,

Sorry i haven’t gotten back sooner. Im forging ahead with my model and its going great. So you really helped to unstick me. Thanks for your support. :relaxed:

To help others heres my code thats been working. Interestingly, you may notice that the val_loss is not in the 0.90s. I though this was due to the basic nature of the model or data, but i now believe is that its probably due to the features in my training data. I create the data through a random process, but not an always optimal process as a result. So my examples are not as efficient as what is optimal, it seems that my nn has worked this out. It rarely goes down non optimal routes, instead it biases the optimal direction.

Next steps might include:

  • a RNN model that remembers state in some way
  • a model that includes a second output regarding the ‘steps’ to a sorted solution and have twin outputs to train on. eg here https://github.com/fchollet/keras/issues/1320.
  • 1D convolutions (but i think its probably not necessary, given the size of this data set)
  • increase my data size eg 1-50 vs 1-6
  • learning rate adjustment.
  • some sort of feedback mechanism unsupervised learning to iprove the model outside of my training data.

If anyone has any comments please chime in.
Cheers

PS looking forward to part 2 of course @jeremy



labels_raw[:20]
> array([4, 1, 0, 2, 2, 0, 6, 0, 5, 0, 3, 4, 1, 4, 5, 5, 0, 3, 3, 4])

labels_raw.shape
> (3059,)

labels_oneHot = onehot(labels_raw)
labels_oneHot[:4]
> array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.]])


data_raw.shape
> (3059, 6)

data_raw[:4]
> array([[2, 3, 4, 5, 0, 1],
       [2, 3, 1, 0, 4, 5],
       [0, 1, 2, 3, 4, 5],
       [1, 0, 3, 2, 4, 5]])


from sklearn.cross_validation import train_test_split

data = data_raw
labels = labels_oneHot

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.3)
data_test, data_val, labels_test, labels_val = train_test_split(data_test, labels_test, test_size=0.5)
​
vocab_size = 6
label_num = 7

seq_len=vocab_size # the number of digits in each input sequence
latent_factors = 32 # the number of coeeficients we use to describe?
output_factors = label_num
In [507]:

model = Sequential([
    Embedding(vocab_size, 6, input_length=seq_len),
    # Conv1D(input_shape=(6, 16), padding="valid", filters=16, kernel_size=3),   # gave this a try but didn't improve results
    Flatten(),
    Dense(100, activation='relu'),
    Dense(7, activation='softmax')])


model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, 6, 6)              36        
_________________________________________________________________
flatten_5 (Flatten)          (None, 36)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 100)               3700      
_________________________________________________________________
dense_10 (Dense)             (None, 7)                 707       
=================================================================
Total params: 4,443
Trainable params: 4,443
Non-trainable params: 0
_________________________________________________________________
In [509]:

model.fit(data_train, labels_train, validation_data=(data_val,labels_val),nb_epoch=2, batch_size=4)

Train on 2141 samples, validate on 459 samples
Epoch 1/2
2141/2141 [==============================] - 1s - loss: 0.3404 - acc: 0.8761 - val_loss: 0.3227 - val_acc: 0.8820
Epoch 2/2
2141/2141 [==============================] - 0s - loss: 0.3094 - acc: 0.8875 - val_loss: 0.3223 - val_acc: 0.8820

data_test[:4]
> array([[0, 1, 4, 5, 3, 2],
       [1, 0, 4, 5, 2, 3],
       [3, 2, 4, 5, 1, 0],
       [4, 5, 1, 0, 3, 2]])

preds = model.predict(data_test,batch_size=4, verbose=1) 
  4/459 [..............................] - ETA: 0s

preds[:2]
> array([[ 0.0022,  0.0523,  0.0701,  0.0661,  0.6548,  0.045 ,  0.1095],
       [ 0.0047,  0.1752,  0.1334,  0.1825,  0.1576,  0.1533,  0.1933]], dtype=float32)


# get from embedding to max argument
preds_choice = np.argmax(preds, axis=1)

preds_choice[:10]
> array([4, 6, 2, 5, 6, 4, 0, 6, 2, 2])

:eyeglasses: PPS I have tidied up the posts in this thread with some BBCode tags. I hope its a bit more readable :eyeglasses:

Please what does “ETA: 0s” mean ?

ETA (learning rate ?)
s (seconds??)

Thanks

ETA: Estimated Time of Arrival, i.e. estimate of how much time will still be needed to complete processing. ETA: 0s suggest epoch is ending.

Great. Thanks!