Time series/ sequential data study group

Saw that Google has a new model for time-series forecasting using transformer, maybe someone is interested in it.


9 Likes

I am currently trying that challenge, really cool seeing it here. I am new to ML/DL so I am struggling with the approach.

Did you find a approach to using a Time Series Regression for this challenge? I tried using Tabular but the results are meh. It only views the individual CDMs and not as a timeseries.

The timeseriesAI repo has been helpful but I am struggling to get the kelvins challenge data into the right format for a Databunch. Maybe you figured out a way to do it?

Hi, welcome to the community!

We tried different approaches, not all of them focused on DL. Our best try was to use LSTMs using only the last values of the target time series. However, the results of the leaderboard showed that ML was not playing a huge role though, probably because of the differences between the training and test sets.

For using timeseriesAI and Databunchs, you have to have each of the time series in a different row, and if you go for multivariate, the order of the rows is important and must be preserved across different variables.

I think I wrote a function to transform the input dataset from the challenge to a format for timeseriesAI. It is programmed in R though, but I can share it with you if you are interested.

Best!

1 Like

Dear community,
I would love to hear from you what you currently consider “best practice” for working with time series data in fastai.
Do people stick to tabular transformations or use functionalities originally intended for text?

1 Like

Thanks for the response! I would be interested in the R code. Maybe I can use the underlying idea to make it work with python.

For those who are interested in fastai v2, I shared my timeseries module for fastai v2 in this post. You will found more information there as I’m avoiding duplicating the same information over here.

Here is a crazy idea that I want to bounce with you.
I have a time series classification problem.
My users wants to see if in 6 hours we are going to have a maintenance issue.
I have 120+ sensors feeding data to me every minute.

Here is my approach.

  1. I’m generating an independent chart of every one of the sensor 120 of them for 6 hours with intervals of 5 minutes between them and then see if my target value 6 hours in the future is Normal or
    Error.
  2. Run it though a CNN to classify them
  3. Use CAM to highlight the charts with error to find out what are the variables that will affect the outcome in 6 hours based on the readings today.

Have any of you tried something like this?
Any pointers that I need to be aware?

do you mind to share the data so we can try it? I hope it has train and test set…

My only question is do we know if CAM will highlight the variables that were important? (Has this been successfully viewed before?) if so then it seems to check out in my book atleast

I don’t know. That’s what I’m trying to figure out.
If this works out. CAM will show me the variables that I will need to observe and maybe the ranges of values for the operator to correct the problem and prevent an outage

image

1 Like

How does everyone in the group keep up to date with latest research/work (particularly with time series, but may be interested with other work too)? What do you guys follow besides this group?

1 Like

@gerardo, IMHO your use-case falls more in the probabilistic forecasting. It’s a savant word to say that instead of predicting one value per time step, you predict a range of values inside a certain percentile : Image please:

As you can see our predictions fall in different percentile intervals. You can then decide that if a prediction falls outside the 90% percentile (for example), it will be considered as an anomaly and you trigger an alert for example. Bear in mind, this a simplified explanation, and one way to do anomaly detection :wink: . Another way would be using time series classification. However, in your case you are also forecasting your time series

You may check @takotab time series forecasting module fastseq (see here above) that he implemented in fastai v2 (it’s for univariate time series (one variable)) or Amazon Labs’ GluonTS Tutorial. I talked about it here

You can also search on Google for time series forecasting anomaly. Be prepared, there is a lot of information.

4 Likes

I have a question regadring sequence data. I have some input data for example.
[[2,4,5],[4,6,8],[4,9,4],[categorical_var]] and out would be another sequence like [8,5,9]. Do you guys any suggesstion how to approach to this problem or any notebook available where such types of problems are found. Any type of help is highly appreciated

@ourownstory, welcome to the fastai community.

I will try to answer your question. There are different approaches that have been used to process time series data using deep learning. Time series can be divided in 3 categories:

Time series Classification/Regression : These categories can be put under the same umbrella as they share some common background. For these 2 categories, time series can be treated either as a tabular data or 2D tensors similar to 3D tensors used in images (as expressed in `fastai v2’ module ).

For the tabular approach, you can check both @oguiza TimeseriesAI repo (using fastai v1) and @tcapelle timeseries_fastai (using fastai v2).

For the 2D tensor approach, you may check timeseries (using fastai v2). In this approach, I draw a similarity between TensorImage (fastai v2 native class) and Tensor2D (that I introduced in the timeseries module). In fact, we have the following mapping:

TensorImage  <---> TensorTS
Conv2D       <---> Conv1D

Time series Forecasting is like a separate category. Lately, there is a lot of research that has been published in this domain. It seems that LSTM is one the popular approach that showed some strong results. Time series forecasting benefits from the LSTM architecture as it inherently takes into the sequentiality of data in a similar way found while training a language model (predicting the next word or the next sequence of words).

You may also check @takotab time series forecasting module fastseq that he implemented in fastai v2.

I hope this information will give you a kind of summary of the different modules (that I’m aware of) developed using fastai v1 and v2 for time-series processing using deep learning.

6 Likes

Thanks a lot for organizing this thread. Are there any resources either here or in other threads that you are aware of, where fastai is used for time series anomaly detection

2 Likes

I have used the old fastai course v3 version to use deep learning on a data set with a time component. I am wondering what methods exist to apply this to a big data set when you have limited memory.

My understanding of the method is you create the databunch, you need the categorical variables to be all present to make the embedding. Is it possible using pytorch to read new files from a csv and train on data too big to fit in memory? Say I had a 50gb tabular file and I had 8gb of computer memory. In the existing framework, would it be possible to chunk the training data into small pieces and keep learn.one_cycle() on new pieces of data? I can see issues with IDs that occur that weren’t in the first chunk. Or say your categorical variables change like month, so when you use entity embeddings it will likely mess things up. How can these methods be applied without increasing your computer specs? Is it just impossible?

Hi @tabularguy,

I was having the same issue with some of my datasets, that are larger than RAM. I investigated the issue and found np.memmap. This allows you to use larger than RAM numpy arrays on disk almost as if they were in memory.
I’ve created a notebook to practice with np.memmap a bit. I don’t know if it may useful to you.

3 Likes

@oguiza Thank you for the reply. Do you think with this, say I had 4 years of data and I wanted to predict the next half year. Within the epoch training and for minibatch can you use this with the data bunch to train as well? Even if I can perform computations on this, can it be piped into the learner without causing memory issues? My understanding of the memory usage in an epoch is that the dataset is stored in memory, and the current minibatch is then tossed to the GPU to update weights.

If I only had 8gb of computer memory and I wanted to train on say a 50gb file, can this method be used with epochs and minibatches to train the tabular learner? I am really still kind of new to this.

I’ve used np.memmap as a replacement for numpy arrays. In my case I also have an 8GB RAM, and a 20GB dataset. Data is stored on disk, and the dataloader creates a batch on the fly.

A limitation to this approach is that np.memmap can only store data with the same dtype. So it’d be more complex if you want to use multiple dtypes in your tabular data.

So for example, if I wanted to do rossman type data that is 50gb that has several different types of columns, come numerical and some categorical, the np.memmap could only hold one column of data or it could hold all of the character columns? I am wondering how things like entity embeddings would work in this case too. Sorry, I am trying to avoid having to update my hardware and train on bigger data.