Handling Streaming data

Hi everyone,

I’m in a pickle here. I have a problem where I get streaming data and a Pytorch model which is to be continuously trained on the said streaming data (in batches of 16 multi-feature data points). Now I have been trying to come up with a class/code structure which would facilitate this, but I have been unable to come up with a solution. My model is initialized through a class. I tried something like this:

def train(model, data, batch_size):

# Get data in appropriate format
train_data = TensorDataset(torch.from_numpy(np.array(data)), torch.from_numpy(np.array(data)))
train_loader = DataLoader(train_data, shuffle=False, batch_size=batch_size)

model = model.double()

# Model Hyperparameters
criterion = nn.MSELoss()
optimizer = optim.RMSprop(model.parameters())

# Keep track of training loss
train_loss = 0.
# Train the model
for data, label in train_loader:
    data = data.double()
    label = label.double()
    # Clear gradients of all optimized variables
    # Convert data to appropriate format of : (batch_size, seq_len, input_dimensions)
    data = data.view(batch_size, 1, data.size(1))

    # Forward pass
    output = model(data)
    # Calcualte batch loss
    loss = criterion(output.squeeze(), label)
    # Backward pass
    # Perform a single optimization step
    return model, loss

And then calling the function on each collected batch individually, such as:

model, loss = train(model, data[:16], 16)

But this is a very stupid approach in my mind and there has got to be a better approach. Moreover, this does not give me the right losses anyway (I want to get losses per batch as an output) as the losses are calculated as if the model is re-tranied from scratch at every call (I thought that passing back the model would ensure that its weights after each update are maintained, but I seem to be wrong).

Any help would e appreciated.

P.S.: I am not a software engineer but more of a domain expert data scientist and so please forgive me if I made any naive mistake.