Timeseries Sequence From Dataloader


I have a timeseries dataset that is similar to the example below and I have been trying in vain for a couple of days to figure out how to modify .dataloaders() (assume a batch size of 1) to give me a tensor.shape(3,6) that is batched by ID 0, 1, etc. Is this possible?

import pandas as pd
import numpy as np

# Generate time series data
n = 4
ts1 = np.random.randn(n)
ts2 = np.random.randn(n)
ts3 = np.random.randn(n)
cat_col = np.random.randint(n)

# Create unique ID values for each time point
ids = range(n)

# Repeat each ID three times
repeated_ids = np.repeat(ids, 3)

# Generate random binary labels
labels = np.random.randint(2, size=n*3)
cat_col = np.random.randint(n, size=n*3)

# Combine the data into a dataframe
data = {'ID': repeated_ids, 'TimeSeries1': np.tile(ts1, 3), 'TimeSeries2': np.tile(ts2, 3), 
        'TimeSeries3': np.tile(ts3, 3), "c_col": cat_col, 'y': labels}
df = pd.DataFrame(data)

# Preview the dataframe

Bumping to see if anyone can help

Hello, @DannyK ! I don’t quite understand the task but you can try to create your own dataset and dataloader. Something like this:

from fastai.data.all import *
class MyDataset:
    def __init__(self, df, name='train'): 
        self.df = df
        self.name = name
    def __len__(self):         
        return len(self.df['ID'].unique())
    def __getitem__(self, j):
        XY = self.df[self.df['ID']==j]
        X = XY[['ID','TimeSeries1','TimeSeries2','TimeSeries3','c_col']]
        y = XY['y']
        return tensor(X), tensor(y)
train_ds = MyDataset(df, name='train')
dls = DataLoaders.from_dsets(train_ds, bs = 1)

Here I have assumed that you need a label column so the output is not one tensor with shape tensor.shape(3,6) but two tensors - one with tensor.shape(1,3,5) and one with tensor.shape(1,3,1). The first dimension is for the index in the batch. You can check the reslut like this:

len(dls.train_ds) # --> 4
x, y = next(iter(dls[0]))
x.shape, y.shape # --> (torch.Size([1, 3, 5]), torch.Size([1, 3]))
x, y

@krasin Thank you so much! You nailed my question to a T.

The basic task is that I have wanted to be able to load in my dataset based on how long the sequence of data was in my dataset. I kept trying to control it through the dataloader instead of a Dataset

Actually, I do have one more question. I adjusted dls to be a TabularDataloader but I’m getting an error: AssertionError: Match length mismatch.

dls = TabularDataLoaders.from_dsets(train_ds, cat_names=['c_col'], cont_names=['TimeSeries1', 'TimeSeries2'], y_names='y', bs=4)

Any idea how to fix this issue?