Help with Tabular project

Mark_F · December 3, 2020, 10:17pm

Hi. I am looking for help or advice for a tabular project that I had hoped to do.

Just some quick background. I am a physician who has been studying data science, neural nets and fastai in my spare time for a couple of years, so no formal training. I was hoping to create an algorithm using a tabular learner that will predict the number of MRI inpatient requests for a day. I will have the relevant data for about four years, so roughly 1500 data points(days). My concerns are mostly around this relatively small dataset.

First, I’m concerned about whether a 70/15/15 split will give me enough data for training, validation and test.

One alternative option is to use K-fold cross-validation, possibly with a hold-out test set. But the fastai course suggests that for these sorts of time series predictions, it is important that we do not select randomly from the training data for training and validation sets. We should use temporally contiguous training data, followed by temporally contiguous validation data whose date is after the training set dates. The notion, as I understand it, is that we want our algorithm to predict future data from past data, so need to train it that way. This seems like it will make K-fold cross-validation impossible?

The other complication is that COVID happened. This will be almost entirely in the later dates. I can include COVID case numbers as a feature, but it will be zero for most of the dates, and almost all of the training data if I use earlier dates for training and later dates for validation/test sets.

I can’t think of a workaround if I don’t shuffle the data, or at least use validation sets that precede the training sets. But by my understanding, I will not be training the neural net to make future predictions if I use shuffled data, but to fit data to patterns within time periods.

Perhaps the problem is not as bad if I use contiguous dates but relax the requirement that validation data follows training data. Then I could use k-fold cross-validation, and COVID data would be in both validation and training sets. For example, Jan 1-31, 2015 could be the validation set, and Feb 1, 2015 - Sep 30, 2020 could be a training set (just an example).

Does anyone have any thoughts? Thanks in advance.