Handling Streaming data

akshayb7 · October 22, 2019, 4:31am

Hi everyone,

I’m in a pickle here. I have a problem where I get streaming data and a Pytorch model which is to be continuously trained on the said streaming data (in batches of 16 multi-feature data points). Now I have been trying to come up with a class/code structure which would facilitate this, but I have been unable to come up with a solution. My model is initialized through a class. I tried something like this:

def train(model, data, batch_size):

# Get data in appropriate format
train_data = TensorDataset(torch.from_numpy(np.array(data)), torch.from_numpy(np.array(data)))
train_loader = DataLoader(train_data, shuffle=False, batch_size=batch_size)

model = model.double()

# Model Hyperparameters
criterion = nn.MSELoss()
optimizer = optim.RMSprop(model.parameters())


# Keep track of training loss
train_loss = 0.
# Train the model
model.train()
for data, label in train_loader:
    data = data.double()
    label = label.double()
    
    # Clear gradients of all optimized variables
    optimizer.zero_grad()
    
    # Convert data to appropriate format of : (batch_size, seq_len, input_dimensions)
    data = data.view(batch_size, 1, data.size(1))

    # Forward pass
    output = model(data)
    
    # Calcualte batch loss
    loss = criterion(output.squeeze(), label)
    
    # Backward pass
    loss.backward()
    
    # Perform a single optimization step
    optimizer.step()
    
    return model, loss

And then calling the function on each collected batch individually, such as:

model, loss = train(model, data[:16], 16)

But this is a very stupid approach in my mind and there has got to be a better approach. Moreover, this does not give me the right losses anyway (I want to get losses per batch as an output) as the losses are calculated as if the model is re-tranied from scratch at every call (I thought that passing back the model would ensure that its weights after each update are maintained, but I seem to be wrong).

Any help would e appreciated.

P.S.: I am not a software engineer but more of a domain expert data scientist and so please forgive me if I made any naive mistake.