Help needed to process structured time series dataset for RNN using fastai lib

Hi All,

I was wondering if someone on the fastai forums could help me with how i can use the pytorch or fastai or other libraries to manipulate or structure my training dataset or iterator design. This intro is a summary, below the line is the detail.

What i want to do is essentially run an RNN across each row of data in my dataset, with each column element being a time step. Then skip to the next line and repeat. In my earlier dense and conv nets i just flattened the data as sequence was not important.

Now i want to use the time / step nature of my data in an RNN, but the ends of my rows are like an end state, that i am trying to find. So i want the RNN to learn how to reach this end state by examining each row and learning the patterns in state structure and time.

To compare this with something like a character RNN that predicts the next character, its as if the full stops of each sentence need to be a final state of the data, and each sentence start a reset of the pattern learning process.

Have a look below for an example of the data.

I through maybe we could make each row a ‘batch’ and cycle through but that to me seems maybe the wrong way to go about it. That we would rather want a batch that spants a number of rows, and a different way of pulling each row to the RNN

Does anyone have some suggestions of how to handle this scenario?

@jeremy @rachel I was wondering if you can help or point me to some really good posts / literature that helps with how to deal with unique / custom datasets. It seems like this is an area that is really important, though a lot of the fastai course material focuses on datasets that are well know and pre-prepared.

I’d love some of your advice or direction about how to create better iterator, data shaping methods to get any dataset to work with models.


The detail:

Heres a sample of what my dataset look like

x.shape = (1000, 20, 45)
y.shape =  (1000, 20 )

They essentially look like this

**x** data 
[ [ T, R, T, F, ... C, A ],  length 20
 [  X, Z, R, T, F ... , A ] length 20

A…Z are data arrays that look like this

array([39, 33, 34, 13, 50, 38, 43, 26, 30, 45, 35, 46, 42, 41, 47, 12, 15, 48, 16, 11, 37, 20, 23,  .., 17, 21, 18, 40, 19, 49, 32, 51, 28, 53, 27, 25,  1,  4, 54, 29, 14, 24,  9,  6,  3]

I have just called them A…Z to avoid making this a huge post. In actuality there are n! or 45! permutations not 26 (A…Z)

y output state (length 20)

[array([37, 26, 20, 13, 17, 44, 34, 37, 38, 30, 28, 40, 33, 33, 37, 24,  4, 40, 14, 22,  0]), 
 array([ 3, 27, 26, 16,  4,  7, 26, 22, 40, 22, 31, 43,  6, 24,  3, 23, 23, 35,  5, 31,  0]), 

Now when i built my first dense and conv models i just flattened the data them so I had a dataset

(20,000, 45)
(20,000, 1)

Then just took the 45 values and used embedding / conv / dense layers to match to the single output.

I now want to use pytorch and the fastai libraries to create an RNN that us an arbitary number of recurrent unit to calculate the output state.

So i guess i want to take each row of data, perhaps prepad it with the start state ( i haven’t quite decided) Heres an example with padding of 3 using start state and nul output

x .    [ T,  R,  T,  F,...,  C, A ]
y      [37, 26, 20, 13,..., 22, 0]
# prepadded approach
x .    [ T, T, T,   T,  R,  T,  F,...,  C, A ] .  padded with first datapoint
y      [ 0, 0, 0,  37, 26, 20, 13,..., 22, 0] padded with nul resultt

The idea is i can then present the model a state eg T and predict the next y

So using the v2 lesson 6 code, i want to go through each row, train the RNN, then restart on the next row, and so on and so forth.

[ T,T,T, T, R, T, F, … C, A ]
[ T,T,T, T, R, T, F, … C, A ]
[0, 0, 0, 37, 26, 20, 13,…, 22, 0]

You can see if i flatten these after padding i will get weird starting inputs that might not make sense to the model. C, A, T, T = 0 with a 4 input RNN.

So I am guessing i have to edit this secion of code. Either creating a different columnarmodel data setup or different iterator which brings the data together.

val_idx = get_cv_idxs(len(idx)-cs-1)
ms = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs = 512)
m = CharLoopModel(vocab_size, n_fac)
it = iter(md.trn_dl)
*xs, yt = next(it)
opt = optim.Adam(m.parameters(), 1e-2)
fit(m, md, 1, opt, F.nll_loss)

Comments and help would be greatly appreciated.

Ok, I think I am making progress with my specific problem. So hold off on solutions.

However if you have a more general suggestion as to my question regarding dealing with unique or custom datasets and how to prepare them, please add your thoughts.

I’ll post an update when i have something more.