Time series/ sequential data study group

Hello here, I wanted to ask for ideas how ti improve the model for Kaggle’s LANL Earthquake. Here’s my kernel https://www.kaggle.com/manyregression/fastai-tabular-nn. The preprocessing I borrowed from other kernels, a usual tabular nn is used. I also have a random forest kernel.
The first idea which came to my mind after getting in this thread is to try predicting on images, so I found this repo https://github.com/mb4310/Time-Series.
And then I am lost, there is too much for my knowledge, I am not even sure where to start. So I am going to start with CNN_multicore.ipynb, and I welcome any ideas, or directions to move.

I am also working on this competition, we could team up.

Great thread. Will follow it more closely. Hope this is the right place to ask this question.

I’m looking for some guidance on how to phrase a problem better. Assuming below toy set.

Temp Pressure Windspeed
11:00:00 23.4 100 4
11:05:00 23.6 99.8 8
11:10:00 23.7 99.5 4
11:15:00 23.7 99.4 12

If I want to use the first 3 rows to predict the entire row 4 observation for all the variables at 11:15, how do I describe this problem?

So x =

Temp Pressure Windspeed
23.4 100 4
23.6 99.8 8
23.7 99.5 4

& y =

Temp Pressure Windspeed
23.7 99.4 12

Perhaps: multi variate to multi variate time series prediction?

Is this something RNN / LSTM / GRU can do (simply) without an overly complex architecture? Anyone aware of any examples tutorials? I’ve been searching for hours on this.

Hi @chrisoos,
I’m not aware of any single model that can easily manage this type of problem.
There are alternative approaches though that would require 3 models (based on your toy example), each one predicting a single target variable. Each model would take as input either a univariate time series (for the corresponding target) or multivariate time series (in case there is some correlation between the variables).
In your example, I’d probably build a multivariate dataset for each of the targets.
There is a detailed discussion (Air Pollution at different sites dataset) that covers some of the available approaches for this type of problems here.

1 Like

Thank you. I have looked at the link referenced before posting. I suppose it helped inspire my toy set :slight_smile:

I will look into your proposed approaches.

I can’t help but feel that there should be a more elegant solution. I’ve seen tabular models for example that can predict both stock price and variance. Surely RNN’s should be able to handle the extra dimension.

Here is an example of multi target DNN predicting 6 targets at once using tensorflow.

I’m basically wanting to do this but using a RNN / LSTM / GRU for time series.


Hi Chris. I feel your frustration trying to find any decent tutorials and code for basic time series. I have spent the past two weeks searching, adapting, tracing, and debugging low quality code and my own toy examples. At last the landscape is becoming clearer. I think your toy problem is completely amenable to a simple RNN.

First, the world’s shortest overview of RNNs. It’s for my benefit to explain verbally and for yours perhaps to understand better. Let’s forget about batches for now and consider just one time series, like yours with three time points. The RNN processes these one at a time in sequential order. At each step it outputs a “hidden state”, usually a vector. The hidden state and the next sequence element are fed into the same RNN again until the end of the series is reached. There are a bunch of gates and learnable weights inside the RNN which calculate the next hidden state from the previous one and the new input. But that’s all the raw RNN itself does: (hs(n),s(n)) -> hs(n+1)

The hidden state carries the memory of what has come earlier in the series. Otherwise, the RNN would not know anything about what came before. Except for the hidden state it has complete amnesia for inputs that have come earlier, just like an ordinary one-shot model. But this hidden state, though it carries much information about the series, is useless to us, because we do not know the meaning of any of its elements. So we add another layer that takes in the hidden state and outputs the quantities we are interested in knowing or predicting. This is usually a Linear (fully connected) layer. It is applied to the hidden state at each time step to get the output or prediction at that step. The last piece is to apply a loss function to Linear’s output vs. the target at each time step. Then backpropagation and gradient update allow the weights inside the RNN and Linear to be trained.

A few glosses… for a time series, the output is typically a time prediction of the next input, but the output could be any quantity or quantities you want to predict (careful not to confuse “predict a class” with “predict the future”). You do not have to compute the output and loss at every time step. And you can decide how often to do backpropagation/gradient update. The whole picture gets more complex with language models: encoding and decoding, padding, bptt, partial sorting, etc.

In PyTorch, nn.LSTMCell implements the RNN. It does exactly (hs(n),s(n)) -> hs(n+1). It would appear in the model’s forward() in a loop that processes the time series in order, passing hs to the next iteration. You decide how to process the hidden state into a prediction and loss. PyTorch also provides nn.LSTM, which processes an entire time series at once. Although it must operate sequentially internally, it’s a gazillion times faster than LSTMCell in a Python loop. Its output is therefore a series of hidden states, one for each time step. You can then apply Linear to them to get the predictions at each time step.

As for your particular toy problem, if you look at the docs, nn.LSTM lets you set the number of input features per time step, so you can certainly use your three. Number of outputs is determined by the Linear layer, so you can make it three or whatever you choose. You might also want to consider using the whole time series at once for input, rather than just groups of three.

You will also need to read up on PyTorch DataSet and DataLoader to create individual elements of your training and validation sets. Remember, each single element of the DataSet is a time series. Once you have DataLoaders you can make a DataBunch, a Learner, and use fastai’s conveniences.

I do not know whether fastai can handle time series data directly. I asked on the forum, received no reply, and so went directly to PyTorch. As for batches, I have not figured them out! My DataSets return a single time series, so bs=1, one batch per epoch, and GPU saturation even so. Maybe you will figure out how to use batches and explain them to me.

I have pasted some code fragments for the model and training loop to help you get started. And if anyone finds bugs in my code or explanations, please tell me!

class LSTMSimpleMdl(nn.Module):
    def __init__(self,ni,nh,nl,aInput):
        # ni - number of input features
        # nh - number of hidden features
        # nl - number of stacked LSTM layers
        # aInput - True: append input before sending to linear (an experiment)

        self.NLAYERS = nl
        self.NHSIZE = nh
        self.NINPUT = ni
        self.aInput = aInput

        self.lstm1 = nn.LSTM(self.NINPUT, self.NHSIZE, self.NLAYERS, batch_first=True)
        self.linear = nn.Linear(self.NHSIZE + (ni if self.aInput else 0), 1)

    def forward(self, input):
        ninput= input
        h_t = torch.zeros(self.NLAYERS, input.shape[0], self.NHSIZE, dtype=torch.float).cuda()  # hidden state for each batch element
        c_t = torch.zeros(self.NLAYERS, input.shape[0], self.NHSIZE, dtype=torch.float).cuda()
        output, (h_t, c_t) = self.lstm1(ninput, (h_t, c_t))

        if self.aInput:
            output = torch.cat((output, input), dim=2) #Append the original inputs, skipping around the RNN

        output = self.linear(output)
        return output.flatten(1)

model = LSTMSimpleMdl(1, 100, 1, False).cuda()

def lossFlat(p,t):
    return loss_fn(p.flatten(), t.flatten())
loss_fn = nn.MSELoss()

def trainN(N,lr):
    global test_pred,vmtarget,output,mtarget,mbatch,vmbatch
    optimizer = optim.Adam(model.parameters(), lr=lr)
    for i in range(N):   
        for mbatch,mtarget in training_generator:
            output = model(mbatch)
            mtarget = mtarget[:,-output.shape[1]:] #shorten target to match shortened output
            loss = lossFlat(output, mtarget)
            if i%1==0:

        with torch.no_grad():
            for vmbatch,vmtarget in validation_generator:
                test_pred = model(vmbatch)
                vmtarget = vmtarget[:,-test_pred.shape[1]:] #shorten target to match shortened output
                vloss = lossFlat(test_pred, vmtarget)
        print('%i %2.9f %2.9f' % (i, loss.item(), vloss.item()))


data = DataBunch(training_generator,validation_generator)
learn = Learner(data, model, loss_func=lossFlat)
learn.fit_one_cycle(10, max_lr=.01, wd=0)

Thank you so much for the code. Very helpful. I’ve been able to train model predicting time series of 20 features using fastai API at bottom and batch of 1.

Only caveat is that window_length needs to be the same length of the features. If not this generates a size mismatch error.

I will play around and see if I can figure out the batches.

There was a interesting presentation in a recent local meetup about time series forecasting: https://docs.google.com/presentation/d/1QnOQbToEHmMyO8GkAww6IctqeoTYO71iX17olcl55uI/mobilepresent?slide=id.g3b380269f8_0_0

The material is based on articles from the uber engineering blog:

The data by uber looks very promising!

This could be a interesting project and there is no pytorch implementation yet?


Thx for posting this. I have follow the Uber papers as well. There is c++ code from the author but I am nota c++ programmer and cannot understand it well. There has been proposal to put this forecast model in uber pyro library, but the authors said there is difficulty and put the project on hold. I cannot find the related pytorch code tho.

It would be very interesting to implement the model, is there a recording of the presentation?

I just found ESRNN-GPU which seems to be a Pytorch implementation of Slawek’s method that won M4. I was able to get the code up and running but I can not yet confirm it’s results compared to the competition. You can create a copy of this Colab notebook that I’ve created to play with the code.


Thanks a lot for sharing @ali.panahi !
I’ll try it in a problem I’m working on.

1 Like

Any thoughts on adding relu activation to the self.linear layer?
Does this make sense at all?

How would you choose training, validation and test data for such a time series data? Also can you recommend any good and practical tutorials for starters with time series? Can you do time series using fastai LSTM?
Thanks in advance!

Hi Chris. I could generate a thought or two, but don’t think they would have much value in practice. Better is for you to try it and see what happens. Your output layer can be designed with any structure that takes the hidden state and outputs the target values. You can also preprocess the series that gets sent to the RNN to extract features you suspect are relevant.

What I have learned in a year of fastai study, in case it’s of any value, is that an hour of actual coding/experimenting, with all its frustrations and mistakes, is worth many hours of “collecting”. I do read a few papers each week, recommended in this forum and on Jeremy’s Twitter feed, for inspiration and for techniques to try out.

HTH! Malcolm

I share the same experience. Trying to find guidance on internet is often conflicting and or uninformed. Time series is scattered with text examples which is frustrating.
I’ve extracted some features on the time series and am pushing this through my LSTM. My dataset is just a tad big which is making it difficult to experiment. Each epoch is taking 30 minutes and learning is slow as it is starting from scratch. As a part time hobby coder this time is gold. Perhaps I’ll set up a smaller subset to make experimenting easier and bed down my architecture first before committing to a training my model.
At the moment I’m concerned at a theoretical level that passing the flattened LSTM output_features of 13,760 into a (13760,20) nn.Linear is too drastic which will cause too much information to be lost. I was wondering if a few dense layers with relu activations at the back might help with conserving the important info as well as pick up some correlation in the features that might be present.

For time series data, I follow the following as a guideline:

  1. First 60% training
  2. Middle 60%-80% validation
  3. Last 20% testing

Your plot does not appear to be stationary. The higher peak 3-5 and lower peak 6 will cause some issues for you with autocorrelation. I suspect you should preprocess the data to extract independent variables or features that can help explain it.

This is a decent introductory course although it uses keras and not pytorch or fastai. https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

1 Like

I feel a bit hesitant because we may be talking about different things, and I’m not right now working with LSTM. But here goes.

The LSTM output should be a tensor of #hiddenStates x #timeSteps. Linear inFeatures would equal #hiddenStates. Each timeStep looks to Linear like one element of a minibatch containing #hiddenStates. So I don’t see how 13760 could be Linear’s inFeatures. You would neeed to have that long a hidden state.

During training of one full time series, Linear sees that whole minibatch of hidden states and makes #timeSteps single predictions. Loss is applied between the predictions and targets for each time step, then backpropagation/gradient update. To make a single prediction, only the last hidden state matters. Linear does not need to apply to the previous hidden states, though the LSTM needs to be fed enough history to bring its hidden state forward to the present.

If you run out of memory in Linear because the series is too long you can break the series into training segments. The last hidden state is passed into the LSTM to start the next segment. At least I am pretty sure this is possible but have not tried it, nor searched for working code.

Backpropagation/gradient update can be done at any interval. I think this interval would correspond to bptt in language training.

Does this overview make sense? I am writing from personal understanding, not from working code at the moment. Let’s keep talking until we are both clear on exactly how this works.

I’m not that familiar with dataset formatting norms.

Can anyone tell me how to distinguish between the classes in this dataset?


I think it has something to do with the fact that each array’s last element is a string ‘1’ or ‘-1’ but I can’t see where those strings are associated with either class.

Here’s a quick peak at the gramian field that comes from an element of this dataset

1 Like

hi @MichaelWoodburn,
In all UCR univariate time series datasets, the target is the first column in the dataframe.
In this particular one classes are -1 and 1 as you thought!
I think these are pretty interesting datasets to get familiarized with some of the time series techniques. Good luck!!

1 Like

Hi @oguiza - I am trying to classify similar images, except that I have a dataset of known good images and a dataset of everything else (or better put images with anomalies). Is there a way to build an image classifier that recognizes the one known class and puts everything else into the other class?

Thank you so much in advance!