Time series/ sequential data study group

geoHeil · April 19, 2021, 1:55pm

I also struggled with this initially. Perhaps @Jdemlow @duality you find these discussions here: https://github.com/timeseriesAI/tsai/issues/25 and https://github.com/timeseriesAI/tsai/issues/85 useful for using TSAI with custom datasets/tasks when getting started.

Jdemlow · April 19, 2021, 1:57pm

I will take a look at these thanks a lot and I will edit this to say if it was super helpful

I think the discussions will be something that I will use to try and learn more from. It is interesting that the library is focused on modeling only and the pre-processing isn’t ignored, but has a realization that there are so many different types of preprocessing pieces it’s to many to actually have tutorials on.

Maybe a good solution is that we all post examples on this thread and we help everyone else that comes later on see how that works so that we have more people using this magical library.

If I can solve my problem I will post the solution here reenacted of course

My problem right now is

Many different venues with many different lifts have varying wait times.

Our POC example we have used VAR and a deep LSTM model getting around 1.30 mae and the first stab with TSAI was 1.15-2.00 mae depending on the lift.

What ideally I would love to do is something that https://github.com/timeseriesAI/tsai/issues/85 talked about stacking the data sets per lifts as “session/windows” and having the model take these into account. This problem is interesting because each lift has a correlation to the other.

I will flesh out the objective once I get a better example going.

tombh · April 19, 2021, 2:27pm

I’m an open source software maintainer myself, so I very much know the struggles of both creating complex projects and dealing with users. So above all, the first thing that has to be said is: it’s impossible to please everybody all the time. There are always things that can be improved.

I’ve found the tsai project to be both useful and inspiring. It brings together an unfathomable amount of cutting edge technology, wisdom and experience. The fact that it is free to the world is extraordinary.

I’m not so much of a newbie yet I still very much sympathise with the comments here about the project’s inaccessibility. As always, there are some clear areas for improvement, for example documentation formatting, global imports, tests and types. Attending to those would certainly make things easier for newcomers (even experienced devs become “newbies” when returning after an absence), but even then it’s not going to overcome the inherent difficulties in grappling with the fundamental complexities of this field.

What’s abundantly clear, is that the spirit of this project is generous and welcoming, which more than makes up for the strains of climbing the learning curve. So whilst it would certainly be nice to see some “UX” improvements, I’d never want that to overshadow the gratitude that all this wonderful work is made so freely available

Jdemlow · April 19, 2021, 2:51pm

Couldn’t agree more with the sentiment this is amazing and has huge unlock potential and is on the bleeding edge of everything and I have a love for this community and there is nothing but gratitude and I have been following this forum just to read it because it is so cool.

Fastai has always been about getting people in the door and going wild with helping and adding or coming up with ideas. Jeremy talks about how huge advancement in DL doesn’t always come from PHDs or software developers. Time series barrier for myself and I am sure others is getting a data set to the np.arrays in this library in a way that makes sense. The structure of data and the sequence ordering is so important, but to be make strides getting that understanding is crucial i think maybe I am wrong there. I have always found once I get the underlying structure of the data the rest falls into place.

to be clear this wasn’t a complaint this was a sorrow that I wish I could use the library and UCR data sets are extremely odd that takes nothing away from that fantastic work

duality · April 20, 2021, 9:37pm

Thank you for the replies everyone. I get a sense of the high levels of support offered here. I wish I could share the sentiment about the library but I can’t get started so I have nothing to share, though I’m sure I would be grateful if I did!

So now I’m tossing up whether to persist with image classification, drag myself over the coals and learn tsai, or learn the more time-tested and thoroughly documented methods like random forests.

Which one would be the quickest to learn from here? As stated above, I am fluent (not expert) in python, and have custom datasets that can be rebuilt as necessary for any implementation.

Daniel.R.Armstrong · April 21, 2021, 9:00am

@duality if you are starting out I would take a look at the Rossmann sales competition, Jeromy has talked about it in several courses. He uses the entity embeddings method which is very well documented in lots of blog posts.

rossman_data_clean.ipynb
and
lesson6-rossmann.ipynb

@jdemlow
If you want to look at an example of using the M5 competition SidNg is a Fasiai student that did very well on the M5 competition using the entity embedding method.
His fastai post
His blog post
His kaggle kernel

If I ever have time I think it would be a lot of fun to create a study group or videos content that help people learn about Time series forecasting using Deep learning Methods.

oguiza · April 21, 2021, 4:16pm

Thanks for your feedback @duality. I’m sorry to hear this. It wasn’t my intent to make it so difficult.

I understand how frustrating it is to try to learn something and find that for whatever reason you can’t make substantial progress. I’ve been in that situation (and continue to be) quite often.

I’ll give you a bit of background on tsai. I’ve been working with time series for a few years now. When I started, I tried to apply DL to time series, but couldn’t find any library that met my needs. So I decided to create my own library. I did it while I took the fastai course. I always applied everything I learned to time series. Initially, I built it in a private repo. But later, I decided to make it public and share it with the fastai community, just in case anybody else might find it useful. I did this because I’m very grateful for everything I’ve learned here, and wanted to give something back. This is why it’s open-source. I share everything I build for my own use.

Unfortunately, I have my family and my own work, and I can only work on tsai in my spare time (which tends to be very limited ). I’d love to be able to dedicate more time to tsai because I have lots of ideas I’d like to implement. But quite frankly I can’t.

It never intended to build a structured course (like fastai) that anybody with a little background on time series could follow. I have created a few tutorials though to try and demonstrate how the library can be used. But they are very time-consuming.

Data preparation is a particularly complex area. There are many types of data sources, formats, labels, etc. That’s why I have built some functionality to try to show how the input to tsai needs to look like. But apart from that, data preparation is out of the scope of this library.

Having said this, I’d be more than happy to have contributors to tsai. Actually, the last 2 tutorial nbs have been prepared in conjunction with other forum participants (@williamsdoug and @Pomo).

I know the library can be improved in many different ways. And I appreciate any feedback I get. I always try to improve.

williamsdoug · April 21, 2021, 6:52pm

Hi @duality. Building on the comments by @vrodriguezf and @oguiza, timeseries analysis can very challenging since it is in a less of a mature state than other machine learning problems such as image, language and tabular data. While fastai offers both high-level (turn-key) and intermediate-level APIs, you’ll likely need to use the more complex intermediate-level APIs for timeseries analysis such as the APIs provided by tsai.

To the extent that you are able to transform a timeseries problem into an image problem (e.g. : plot or spectrogram image) and the results are good enough, then that is a fine approach. I believe Jeremey Howard discussed this in one of his lectures and I’ve seen this approach used in some of the Physionet Challenge competitions.

To the extent that image conversion proves insufficient, then you’re likely need to deal with more complex intermediate level APIs. tsai includes a very powerful set of models and data transforms, but at the cost of a steeper learning curve. As with many packages still in the early development stage, documentation is sometimes limited. The tutorials can help, particularly if you add your own code to print out sample data (or types) when the documentation is insufficient.

In terms or your specific question What is X and y? X and y are a fairly standard convention used in various machine learning packages such as scikit-learn where:

X is the training data
y is the label data

Index ordering can vary by package, but is usually obvious if you print or plot the first couple rows or columns.

duality · April 27, 2021, 11:27am

@Daniel.R.Armstrong @oguiza @williamsdoug

Thank you for the replies everyone and I must apologise if I can across as critical of the work you are doing here. I guess I was venting some frustration. I totally understand the amount of effort it takes to put forward something like this for free for the community. Then to have someone come along and complain must seem harsh.

I must admit that I overcame some of my fear by dissecting the UCR examples detail by detail. I found a dataset that is similar to mine and modelled it. I am still stuck, although now i have manually created my X, y and split variables. I just realised that the shape of the X data is back-to-front from what I expected, so will have to rework the whole thing, hopefully it’s a simple fix.

I am hoping it is simple because I don’t know what I’m doing and if I get an error it’s unlikely I’ll be able to fix it easily. But so far everything else seems to be loading ok. If I get a valid prediction tomorrow I will be ridiculously happy.

duality · April 28, 2021, 12:29am

It didn’t work. My accuracy never gets any better than 50/50 for a classification task.

How do I troubleshoot this? Where should I go, what should I read?

Thanks

oguiza · April 28, 2021, 7:29am

Can you share a gist showing what you are doing? Otherwise, it’s difficult to help with the information you have provided so far.

gkumarg · April 29, 2021, 1:11pm

@oguiza First of all, thank you for your valuable contributions!

I am looking for some guidance on a time series classification project.
I have a dataset with 1500 sample meters, 1 feature and 1440 time intervals of data. The twist is that I am also given target variable for each of the time intervals.
So it is not the typical y shape of (1500, ). I need to do target prediction for each time interval in the test set as well. I could not find a way to prepare my data using df2xy mentioned in tsai. What is the best way to approach this problem?

This is what the data looks like, if it helps for visualization:

vrodriguezf · April 29, 2021, 2:57pm

Maybe you want to approach it as a segmentation problem? There is one discussion around that topic in the tsai discussion section:

gkumarg · April 30, 2021, 12:55am

Thanks @vrodriguezf. I see the situation is the same. Just need to figure out how to get my input data into a way that tsai/minirocket can use.

duality · April 30, 2021, 7:54am

I’m not sure what to share since the data itself can be created in so many ways. E.g. I can either predict price movement up or down (categorical) or target price in the future. I can also use up to 14000 variables or as little as 1, but I’ve tried 1 to 8 variables and it’s the same result- 50/50. My problem is broad- like, could it be that the way I have upsampled the data is the problem? Or do I need to include more variables, like maybe 50 to 100? Or do I need to use sliding windows? Or do all my predictions need to be the same number of steps in the future or can they vary by sample? Or should I persist with image classification or tabular versions instead? Or… there’s like a million questions and different thing I can change but what I am asking is how can I figure out for myself what I need to learn/change/do different. All the discussions here are so high level and beyond my current understanding that the leap from where I am to where you guys are seems like walking on the moon.

duality · April 30, 2021, 8:13am

@oguiza also, thanks for being patient, even though I was rude.

I just thought of something looking at your regression notebook. Would there likely be a material difference in accuracy if one model was set to predict a 50/50 increase/decrease in stock price (categorisation) vs predicting stock price (regression)? Using the exact same set of data and all else otherwise the same? In other words, are some problems likely to be more accurate as a regression problem vs a categorisation problem. My initial intuition is that categorisation would be easier but as I think about it maybe regression helps the algorithm learn quicker?

remapears · May 6, 2021, 6:17pm

Dear Dr. Ignacio,

I have tried this latest tutorial for Multti-label classification. I only have a problem with testing on a new set of data.
I tried doing this :

valid_dl = dls.valid

test_ds = valid_dl.dataset.add_test(X_test, y_test.values)

test_dl = valid_dl.new(test_ds)

_, temp_targets, temp_preds = learn.get_preds(dl=test_dl, with_decoded=True, save_preds=None, save_targs=None)

but I get predictions as :

So is this how I should be getting the predictions? if yes, how would I be able to specify the labels’ names and their corresponding columns?

Thank you for all your help!

remapears · May 7, 2021, 11:18am

Updates:
I was able to map the integers to their corresponding labels by manually investigating the test instances of the origianl test set … Then, I computed metrics for each label:

shado · May 11, 2021, 2:02am

Hi everyone,
This looks like an interesting group. I am needing time series forecasting, hoping deep learning can work.

This might be a stupid question, but when you talk about number of samples, does this mean one long time series split into multiple parts, similar to a sliding window?

What if I have multiple datasets from different sources. Can the samples be of different sources and not split up? I.e. use the whole sample and have many different samples?

Hope that makes sense.

Thanks in advance!

williamsdoug · May 11, 2021, 10:01pm

Hi @remapears. I’ve created an updated version of the 01a_MultiClass_MultiLabel_TSClassification.ipynb tutorial notebook to help answer your questions. Examples of label mapping are shown in cells 16-25 for multi-class and cells 44-51 for multi-label.

The updated tutorial is currently available as a gist at either:

Let me know if this addresses your questions. If so, I’ll submit ths update to tsai as well.