This is my first time posting something to the forum so forgive me if I do something wrong. I was wondering if anybody had any advice on how to model time series image data.
Basically, I have ~90,000, 10 minute satellite images that I would like to use to predict the amount of Solar generation in a particular region in the future. The label of the images is the actual solar generation at that time. The issue that I am having is the high correlation between the sequence of images is causing high validation results however this does not hold when I use a test set that wasn’t included in the train/validation time range.
Any advice would be much appreciated.
How do you create train/validation split? Maybe the problem is simply there.
I originally did it randomly and thought it worked very well. Unfortunately when I went to classify brand new images my accuracy went from 83% to 25%. I’ve now split the train/valid data based on time. Basically, 2015-2017 is train and 2018 is test.
Now I have a major over fitting problem on my hands which I’m trying to sort out.
Would you please describe how is your training data set up? To understand why overfitting occurs.
Is there a data leak between val and train?
Is test data completely different from train/val data?
The test data is somewhat different from train/valid. The images are lidar images of the wind field. So an image is captured every 6 minutes. The train/valid is randomly split 80/20 from the years 2015-2017 and the test is 2018. The the test should be similar but not completely the same as the wind is chaosic in nature.
How do you handle time series nature of the data?
I’m not entirely sure what you mean. The training data is now 2015-2017 with images captured every few minutes. The test/valid data is 2018.
the header of this thread says Time series cnn data. You mention sequence of images as well. I thought you somehow captured time series. Do I understand correctly that you are doing standard image classification?
If you get high train acc and low val acc, then your model for some reason memorizes training data instead of generalizing. Make sure there is no data leakage between train and val: somehow validation is done on some train data. Try some standard regularization techniques etc.
If your val acc is high and test acc is low, make sure your data sets have similar properties. Your model has learned everything it can from historical data, but there’s no use of that for predictions.
What model architecture are you using?
Since you’re test set is in the future, you can split your train/val set such that the val set contains future data points not in the training set. In other words, use 2015& 2016 data to train the model and 2017 data for validation. This configuration reflects your test condition. The model is more likely to learn features that will help it learn to predict future data points. Training on random split is unlikely to work.
Take a look at the blog post by Rachel where she discussed this issue. (Time series section)