You need to get all your inputs passed to your forward
method. Below is a toy example that prepares the data by:
- converting categorical data to one hot encoding (dummies) for low cardinality data (male/female). Note that each value added to the one hot encoding creates a new column for your first layer of your NN.
- converting high cardinality categorical data to codes (1 through N, where N is the number of unique values in your column).
- normalizes continuous values (for example into a range of 0.0-1.0 or in a standard normal distribution with mean of 0 and standard deviation of 1)
In forward
, use an Embedding
layer that maps your high cardinality categorical data to a lower dimensional space. Learn more about embedding layers in lesson 7 and 8 on collaborative filtering).
Note the Embedding layer for high cardinality data and the normalization steps are not necessary if you use Random Forests/decision trees (learn about these in lesson 6) instead. Also note from that lesson that as of 2022, tree-based models are state of the art for tabular data.
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Seed for reproducibility
np.random.seed(0)
torch.manual_seed(0)
# Generate a toy dataset
num_samples = 1000
# Continuous features
age = np.random.randint(18, 70, size=num_samples) # Ages between 18 and 70
income = np.random.normal(50000, 15000, size=num_samples) # Income with mean 50k and std 15k
# Low-cardinality categorical feature
gender = np.random.choice(['male', 'female'], size=num_samples)
# High-cardinality categorical feature
num_cities = 100 # High cardinality categorical feature
city = np.random.randint(0, num_cities, size=num_samples) # City IDs from 0 to 99
# Binary target variable
target = np.random.randint(0, 2, size=num_samples) # Binary classification (0 or 1)
# Create a DataFrame
df = pd.DataFrame({
'age': age,
'income': income,
'gender': gender,
'city': city,
'target': target
})
# Preprocess continuous features
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
# One-hot encode low-cardinality categorical feature 'gender'
df = pd.get_dummies(df, columns=['gender'])
# Separate features and target
continuous_features = ['age', 'income']
one_hot_features = ['gender_female', 'gender_male']
categorical_feature = 'city'
target_column = 'target'
# Convert DataFrame columns to tensors
X_continuous = torch.tensor(df[continuous_features].values, dtype=torch.float32)
X_one_hot = torch.tensor(df[one_hot_features].values, dtype=torch.float32)
X_categorical = torch.tensor(df[categorical_feature].values, dtype=torch.long) # For embeddings
y = torch.tensor(df[target_column].values, dtype=torch.float32).unsqueeze(1) # Binary target
# Define the neural network model
class SimpleTabularModel(nn.Module):
def __init__(self, num_continuous, num_one_hot, num_embeddings, embedding_dim):
super().__init__()
self.embedding = nn.Embedding(num_embeddings, embedding_dim)
self.fc1 = nn.Linear(num_continuous + num_one_hot + embedding_dim, 32)
self.fc2 = nn.Linear(32, 1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x_continuous, x_one_hot, x_categorical):
# Get embeddings for high-cardinality categorical feature
x_embedding = self.embedding(x_categorical)
# Concatenate continuous features, one-hot features, and embeddings
x = torch.cat([x_continuous, x_one_hot, x_embedding], dim=1)
# Pass through the network
x = self.relu(self.fc1(x))
x = self.sigmoid(self.fc2(x))
return x
# Instantiate the model
num_continuous = X_continuous.shape[1] # Number of continuous features
num_one_hot = X_one_hot.shape[1] # Number of one-hot encoded features
embedding_dim = 5 # Dimension of embeddings for 'city'
model = SimpleTabularModel(num_continuous, num_one_hot, num_cities, embedding_dim)
# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Simple training loop
num_epochs = 5
for epoch in range(num_epochs):
# Zero gradients
optimizer.zero_grad()
# Forward pass
outputs = model(X_continuous, X_one_hot, X_categorical)
# Compute loss
loss = criterion(outputs, y)
# Backward pass and optimization
loss.backward()
optimizer.step()
# Print loss
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
If you are using a Dataset
and DataLoader
instead of calling the forward pass directly with model(X_continuous, X_one_hot, X_categorical)
, then you will want to implement __getitem__
in your custom Dataset
sub-class that returns values as a tuple of x_continuous, x_one_hot, x_categorical
to match what your forward
implementation is expecting.
Extra: Time Series Analysis
Also, just cause you mentioned “historical nature”, you might want to check out the free book Forecasting: Principles and Practice which goes over some fundamental techniques in time series forecasting. The book uses R, but you can find Python implementations of some of these methods in statsmodels Time Series Analysis.