Hi All,
I was wondering if someone on the fastai forums could help me with how i can use the pytorch or fastai or other libraries to manipulate or structure my training dataset or iterator design. This intro is a summary, below the line is the detail.
What i want to do is essentially run an RNN across each row of data in my dataset, with each column element being a time step. Then skip to the next line and repeat. In my earlier dense and conv nets i just flattened the data as sequence was not important.
Now i want to use the time / step nature of my data in an RNN, but the ends of my rows are like an end state, that i am trying to find. So i want the RNN to learn how to reach this end state by examining each row and learning the patterns in state structure and time.
To compare this with something like a character RNN that predicts the next character, its as if the full stops of each sentence need to be a final state of the data, and each sentence start a reset of the pattern learning process.
Have a look below for an example of the data.
I through maybe we could make each row a ‘batch’ and cycle through but that to me seems maybe the wrong way to go about it. That we would rather want a batch that spants a number of rows, and a different way of pulling each row to the RNN
Does anyone have some suggestions of how to handle this scenario?
@jeremy @rachel I was wondering if you can help or point me to some really good posts / literature that helps with how to deal with unique / custom datasets. It seems like this is an area that is really important, though a lot of the fastai course material focuses on datasets that are well know and pre-prepared.
I’d love some of your advice or direction about how to create better iterator, data shaping methods to get any dataset to work with models.
Thanks
The detail:
Heres a sample of what my dataset look like
x.shape = (1000, 20, 45)
y.shape = (1000, 20 )
They essentially look like this
**x** data
[ [ T, R, T, F, ... C, A ], length 20
[ X, Z, R, T, F ... , A ] length 20
A…Z are data arrays that look like this
array([39, 33, 34, 13, 50, 38, 43, 26, 30, 45, 35, 46, 42, 41, 47, 12, 15, 48, 16, 11, 37, 20, 23, .., 17, 21, 18, 40, 19, 49, 32, 51, 28, 53, 27, 25, 1, 4, 54, 29, 14, 24, 9, 6, 3]
I have just called them A…Z to avoid making this a huge post. In actuality there are n! or 45! permutations not 26 (A…Z)
y output state (length 20)
[array([37, 26, 20, 13, 17, 44, 34, 37, 38, 30, 28, 40, 33, 33, 37, 24, 4, 40, 14, 22, 0]),
array([ 3, 27, 26, 16, 4, 7, 26, 22, 40, 22, 31, 43, 6, 24, 3, 23, 23, 35, 5, 31, 0]),
...
Now when i built my first dense and conv models i just flattened the data them so I had a dataset
(20,000, 45)
(20,000, 1)
Then just took the 45 values and used embedding / conv / dense layers to match to the single output.
I now want to use pytorch and the fastai libraries to create an RNN that us an arbitary number of recurrent unit to calculate the output state.
So i guess i want to take each row of data, perhaps prepad it with the start state ( i haven’t quite decided) Heres an example with padding of 3 using start state and nul output
x . [ T, R, T, F,..., C, A ]
y [37, 26, 20, 13,..., 22, 0]
# prepadded approach
x . [ T, T, T, T, R, T, F,..., C, A ] . padded with first datapoint
y [ 0, 0, 0, 37, 26, 20, 13,..., 22, 0] padded with nul resultt
The idea is i can then present the model a state eg T and predict the next y
So using the v2 lesson 6 code, i want to go through each row, train the RNN, then restart on the next row, and so on and so forth.
[ T,T,T, T, R, T, F, … C, A ]
[ T,T,T, T, R, T, F, … C, A ]
[0, 0, 0, 37, 26, 20, 13,…, 22, 0]
You can see if i flatten these after padding i will get weird starting inputs that might not make sense to the model. C, A, T, T = 0 with a 4 input RNN.
So I am guessing i have to edit this secion of code. Either creating a different columnarmodel data setup or different iterator which brings the data together.
val_idx = get_cv_idxs(len(idx)-cs-1)
ms = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs = 512)
m = CharLoopModel(vocab_size, n_fac)
it = iter(md.trn_dl)
*xs, yt = next(it)
t=m(*V(xs))
opt = optim.Adam(m.parameters(), 1e-2)
fit(m, md, 1, opt, F.nll_loss)
Comments and help would be greatly appreciated.