Time series/ sequential data study group

MichaelWoodburn · June 6, 2019, 10:45am

Yeah just a fantastic read mate, really helped me see where the key variables are and how to make best use of .min and .max functions. Can’t wait to start applying this stuff.

cmackenzie · June 7, 2019, 5:59pm

Hi everyone!

I want to give a time series classification problem a try, probably using an image representation of some sort.

The data I have is unevenly spaced/sampled and of different length, so I was wondering what steps have some of you taken to deal with these issues? Are there common best practices for these kind of time series?

Thanks !

Pomo · June 7, 2019, 6:06pm

(Edited to reply to neoyipeng.)

Hi Neo,

Thanks for the generating notebook. Perhaps I am a bit lazy, but to learn from your part 2 notebook I need only file WSE_labels.csv and folder cnn_input. Folder cnn_input is already on your github page. If you would upload the file, I can easily get started!

Another option is to post a read-only link to your gdrive cnn folder. Privately message to me only, or publicly if you are ok with anyone having the link obtaining read access. I learned only today about the feature to share gdrive folders.

Quandl looks like a great resource, especially if you choose a free database. Thanks for finding it.

Finally, I agree with @oguiza that augmentation on the data images is unlikely to help accuracy. It would amount to some unintuitive operation on the function that generates the image. I’d hate to figure out what it does to the math. Still, it is worth an experiment - you might discover something amazing.

P.S. I did try the notebook WSE_0.3D_Open_Close_Volume_GAF.ipynb, but it references file ‘WSE_metadata.csv’, which is not in the github repository.

P.P.S. ‘WSE_metadata.cs’ can be downloaded from quandl with…
https://www.quandl.com/api/v3/databases/WSE/metadata?api_key=yourapikey

However, quandl next sends back errors for gen_labels() - API request limit.

MichaelWoodburn · June 11, 2019, 4:42am

WHICH COLOR MAP?

In terms of turning a time series into a GASF, it’s possible to choose the cmap (colormap) for the image.

So far I have been choosing “rainbow” because it makes use of the most colours. However, I notice that in this post a green color map is used. “viridis”?

Question:

Which color map do you use to create a GASF which is being fed into a CNN resnet50?

        # GAF transformations
        image_size = 80
        gasf = GASF(image_size, sample_range=None)
        X_gasf = gasf.fit_transform(series_scaled)
        #X_gasf = gasf.fit_transform(series_joke)



        # Show the results for the first time series
        plt.figure(figsize=(8, 8))
        #plt.subplot(121)
        plt.imshow(X_gasf[0], cmap='rainbow', origin='lower')
        #plt.title("GASF", fontsize=16)

@ oguiza any thoughts
@oguiza

oguiza · June 11, 2019, 8:09am

If your question is more which specific cmap gives a better performance, I don’t think it makes much a difference. I’ve seen differences with particular TS examples, but not one that is consistently superior to others. But you may want to test a few of them.

I’ve found a few ways to use encoders that can speed up performance compared to the one with matplotlib.

calculate it and apply a cmap without creating a figure (faster than previous method):

 # GAF transformations
 image_size = 80
 gasf = GASF(image_size, sample_range=None)
 X_gasf = gasf.fit_transform(series_scaled)

 cmap = 'jet' # you can select any plt cmap
 Image(torch.Tensor(plt.get_cmap(cmap)(X_gasf)[..., :3]).permute(2, 0, 1))

calculate it and add it to a 3D array with zeros (as a single channel)

 ImgArr = np.zeros((3, size, size))
 ImgArr[0] = X_gasf
 Image(torch.from_numpy(ImgArr))

Or load it as 3 channels:

 ImgArr = np.zeros((3, size, size))
 for i in range(3):
     ImgArr[i] = X_gasf
 Image(torch.from_numpy(ImgArr))

I’ve tested these options with some time series and the result is pretty similar. I tend to use the 2nd option now because it’s faster. I can do it on the fly, thus avoiding the need to save all images first.

MichaelWoodburn · June 11, 2019, 1:03pm

Great,

I mainly wanted to know whether one particular cmap had superior performance.

It’s amazing that a ‘rainbow’ can be so well processed by the CNN, especially considering the varying changes in pixel values that are occurring for a linear increase in series value.

I’ll show you the model I’ve got going now which classified children into ADHD or Control based on their pupil dilation.

oguiza · June 11, 2019, 2:06pm

Great! Sounds interesting!

marteen · June 27, 2019, 9:17am

Hello everyone.

First of all, thank you very much for providing so many resources in here, it’s a pleasure (and kind of overload as well :-)) to have so many information regarding time series analysis at one place. I could use some input with a project of mine.

I am working on non destructive material analysis. I get a time series of an amplitude and in a first step I want to label each time series with a class 0 / 1 / 2 and want to learn the features of the time series that best predict the class.

Is it viable to first do feature extraction (e.g. with FRESH? Or is there something else that would work better for a periodical signal) and then work with my project as a deep learning classification problem?

Sorry for my newbish questions and thanks for your input.

oguiza · June 27, 2019, 11:02am

Hi @marteen,
I’d like to help you but I’d need to have some more details on your data.
I’d need to know how:

How many samples do you have?
Is it a univariate time series, or multivariate? If multivariate, how many channels?
What is the sequence length?

In general, there are at least 3 common approaches you can use with deep learning and time series:

raw data
transform raw data into images (like GASF, GADF, MTF, RP, etc). One option that might be interesting in your case is to transform the TS into an spectrogram since you mention that the signal is periodic).
calculate features

marteen · June 27, 2019, 11:32am

Hi @oguiza,

amazing, thanks for your fast answer.

I don’t have any samples right now since I am planning to take measurements myself. Let me explain.

I am using a sensor that records an electric signal (univariate time series). I am interested, long term, in predicting material properties with the signal of the sensor. Sadly I don’t have the possibility to create a data set with good labels for that purpose just now. So my plan is the following:

Take 12 material pieces (same material but from 3 different positions in a material coil because we suspect the material properties to vary over the coil) and take 10 measurements per piece. The labels for the time series are the position in the coil that material piece is from. My main question is then:

What features are most relevant for the prediction of the position of the piece?

It makes sense (and is standard practice) to do a FFT to get a spectogram of the samples. Hypothetically, does it make sense to feed the resulting spectogram to a CNN? Are there best-practices for the conversion to an visual representation of such a spectogram were CNN work best?

Not sure about the sequence length atm, since I am exploring the sensor right now. The measurement duration is 1s and I measure between 200 and 600 Hz, so between 200 and 600 data points.

Thanks again.

oguiza · June 27, 2019, 2:11pm

I think it’s impossible to know a which features will be most relevant a priori. And you may not even need that if you use a DL approach.
If you sequence length will be around 200-600, you won’t need to create summary (rolling) features. You can use that when you have super long time series (seq len > 1000s).
For this type of problem I would use as input just the raw data or an image transformation. The model will try to identify the most relevant features.
A concern I’d have is the number of samples. If it’s too small (<100) you may still be able to use DL, but in that case you may be better off using the conversion into an image and transfer learning (using a pretrained vision model), similar to the Olive Oil model I demo’d before in this thread.
Having said this, this is a very experimental area, meaning you’ll need to run tests to really learn what works and what doesn’t.
Please, let me know if you any more clarification in my response.

marteen · June 27, 2019, 2:26pm

You got me wrong here since I was not precise enough - this is my a posteriori question for my research, not my question for you. I want to know what in the time series made the model predict position 0/1/2.

Again thanks for your answer. You are right, it is very experimental and next week I will start generating the data. You are right, it’s pretty sparse, but it’s just a start to see if the experiments lead to something.

oguiza · June 27, 2019, 4:47pm

Ok, I understand.
Some will allow you to visualize which part of the time series contribute to the predicted label. This is a bit different from a feature though.
Good luck!

neoyipeng · June 28, 2019, 2:01am

No worries, I formatted my article into markdown in my github here

neoyipeng · June 28, 2019, 4:14am

Colleague told me about this paper that supposedly has beaten the winner of the M4 competition using a pure RNN model. Could be interesting to see if the results can be replicated!

MichaelWoodburn · June 29, 2019, 7:38am

Just incredible.

One question:

When you use 3 encoders per image, how is the final image arranged?
i.e are they panels, side by side, overlaid. etc.

MichaelWoodburn · July 1, 2019, 1:09am

Monograph_WoodburnM_586789.docx

@oguiza

Thanks for all the assistance in coming to understand GASFs. Especially how to scale data appropriately to the range [-1,1]. I’m looking forward to publish the data soon.

Here’s an interesting application for TSC. In August I’ll be studying in a psychiatric hospital where patients routinely undergo Electroconvulsive Therapy (ECT). Before and during the procedure their brainwaves are measured with EEG.

Hypothesis: the pre-shock EEG predicts depression severity and the post-shock EEG predicts improvement after ECT.

Challenges: the EEG is a 5-channel (at least) signal. A Multivariate time series.

Where should I start to learn if I’m considering feeding the EEG data into resnet as a multivariate time series? (a method which was mentioned in one of the papers pinned to this thread)

nadotec · July 6, 2019, 10:34pm

Hi everyone,

First I wanted to thank you for all the content in this thread, this has been a great read. I’m not completely new to Fastai but this is my first post in the Forums, and I wanted to get your thoughts about a time-series problem I wanted to work on.

I’m looking to do anomaly detection for predicting if a machine will fail by looking at IoT sensor data. I’ve found a toy dataset in this repo that consists of 124K rows with one date column (with daily aggregation by device), 9 attributes (some categorical, some numerical), and one binary (1 for failure, 0 for non-failure). When we look at the data by rows, only 0.01% are failures, but when we look at unique devices, it’s around 10%. So there’s also a high imbalance problem when we want to predict failure (1).

Here’s a way to solve this by using boosting methods, but I’m interested in looking at the possibility of using Fastai, especially exploring LSTM/RNN. Would this be a good example for these techniques or am I better off with boosting?

Looking at some of the last lessons from Part 1 of the course, I converted the time-series data into a bunch of categorical variables (DayOfWeek, EndOfMonth, etc) and fit that into a TabularLearner with different loss functions/metrics (F1 weighted, auroc, etc) but cannot get a good result where there’s not a lot of FP. Do you have any recommendations for reducing FP or should I discard the tabular approach?

Thank you very much for your time and have a great weekend!

marteen · July 8, 2019, 9:24am

Hello,

Not sure about the sequence length atm, since I am exploring the sensor right now. The measurement duration is 1s and I measure between 200 and 600 Hz, so between 200 and 600 data points.

I have produced some data and it appears I was completely wrong.

Actually, the sensor is measuring with a frequency of 4 MHz, so every measurement a length of 4.000.000. After using FFT and Moving Average on the time and frequency this reduces to 128 (frequency spectrum) * 7500 (each) data points.

I suspect, that this is way to big to convert it to an image and use a CNN for classification? I have read about time series lenghts of around 7600 being encoded to images and then classified by CNN.

If anyone has experiences with huge time series lengths and classification I would be thankful for any hint - sadly I have almost no experience in signal processing. Sorry again for my dumb questions. I am a mathematician making the transition into the field of ML.

marcmuc · July 9, 2019, 7:59am

Hey @marteen,
in order to get some ideas of how to deal with your massive amount of timesteps, maybe have a look at this kaggle competition:

The signals there consisted of 800.000 Timesteps, which is less than yours but still much too large to use for standard timeseries-to-image methods…

What most people there did was calculate summary statistics and/or features on windowed subsamples of the data, generating a multivariate timeseries with less samples. For example using a window size of 800 steps, you could transform the 800.000 timesteps into a more managable 1000 timesteps of say min/max/mean/median and then continue working with that timeseries. There is a multitude of features/statistics you can use, have a look at the kernels in that competition.

You can then still decide to use the generated data with an image approach, or rather use gbms or lstms etc. like people did in the competition.