Predictions Dependent on Two Variables

cjyc · September 20, 2024, 1:44am

Hey folks, I’m just finished lesson 3 (starting to get hard here) but wanted to try out Tabular data modelling (which is my main goal at the end of the course).

What I don’t understand currently is how I can predict on something if I don’t provide all the features. For example, let’s say I want to predict which basketball team might win given some historical nature.

I might have something like

Column 1	Column 2	Column 3	Column 4
home_team	away_team	result	featureA…featureN
teamA	teamB	1	some cont feature
teamB	teamC	0
teamA	teamD	1

How would I feed into the model that I want to see the matchup between teamA vs teamC for example?

With all the examples (i.e. the bird one), you simply feed in the block of a bird (which makes total sense).

With the tabular example, you’d have to feed in all the different features related to the salary prediction (age, sex etc.).

In real life models, I’d expect that this is something that is relatively easy to do but I can’t for the life of me understand how I can structure the code? It makes sense that I might be able to see the probability of someone winning if I can feed in all the features but this doesn’t seem realistic?

dougr · September 21, 2024, 3:33am

You need to get all your inputs passed to your forward method. Below is a toy example that prepares the data by:

converting categorical data to one hot encoding (dummies) for low cardinality data (male/female). Note that each value added to the one hot encoding creates a new column for your first layer of your NN.
converting high cardinality categorical data to codes (1 through N, where N is the number of unique values in your column).
normalizes continuous values (for example into a range of 0.0-1.0 or in a standard normal distribution with mean of 0 and standard deviation of 1)

In forward, use an Embedding layer that maps your high cardinality categorical data to a lower dimensional space. Learn more about embedding layers in lesson 7 and 8 on collaborative filtering).

Note the Embedding layer for high cardinality data and the normalization steps are not necessary if you use Random Forests/decision trees (learn about these in lesson 6) instead. Also note from that lesson that as of 2022, tree-based models are state of the art for tabular data.

import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Seed for reproducibility
np.random.seed(0)
torch.manual_seed(0)

# Generate a toy dataset
num_samples = 1000

# Continuous features
age = np.random.randint(18, 70, size=num_samples)            # Ages between 18 and 70
income = np.random.normal(50000, 15000, size=num_samples)    # Income with mean 50k and std 15k

# Low-cardinality categorical feature
gender = np.random.choice(['male', 'female'], size=num_samples)

# High-cardinality categorical feature
num_cities = 100    # High cardinality categorical feature
city = np.random.randint(0, num_cities, size=num_samples)    # City IDs from 0 to 99

# Binary target variable
target = np.random.randint(0, 2, size=num_samples)           # Binary classification (0 or 1)

# Create a DataFrame
df = pd.DataFrame({
    'age': age,
    'income': income,
    'gender': gender,
    'city': city,
    'target': target
})

# Preprocess continuous features
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# One-hot encode low-cardinality categorical feature 'gender'
df = pd.get_dummies(df, columns=['gender'])

# Separate features and target
continuous_features = ['age', 'income']
one_hot_features = ['gender_female', 'gender_male']
categorical_feature = 'city'
target_column = 'target'

# Convert DataFrame columns to tensors
X_continuous = torch.tensor(df[continuous_features].values, dtype=torch.float32)
X_one_hot = torch.tensor(df[one_hot_features].values, dtype=torch.float32)
X_categorical = torch.tensor(df[categorical_feature].values, dtype=torch.long)  # For embeddings
y = torch.tensor(df[target_column].values, dtype=torch.float32).unsqueeze(1)     # Binary target

# Define the neural network model
class SimpleTabularModel(nn.Module):
    def __init__(self, num_continuous, num_one_hot, num_embeddings, embedding_dim):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.fc1 = nn.Linear(num_continuous + num_one_hot + embedding_dim, 32)
        self.fc2 = nn.Linear(32, 1)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x_continuous, x_one_hot, x_categorical):
        # Get embeddings for high-cardinality categorical feature
        x_embedding = self.embedding(x_categorical)
        # Concatenate continuous features, one-hot features, and embeddings
        x = torch.cat([x_continuous, x_one_hot, x_embedding], dim=1)
        # Pass through the network
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x

# Instantiate the model
num_continuous = X_continuous.shape[1]     # Number of continuous features
num_one_hot = X_one_hot.shape[1]           # Number of one-hot encoded features
embedding_dim = 5                          # Dimension of embeddings for 'city'
model = SimpleTabularModel(num_continuous, num_one_hot, num_cities, embedding_dim)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Simple training loop
num_epochs = 5
for epoch in range(num_epochs):
    # Zero gradients
    optimizer.zero_grad()
    # Forward pass
    outputs = model(X_continuous, X_one_hot, X_categorical)
    # Compute loss
    loss = criterion(outputs, y)
    # Backward pass and optimization
    loss.backward()
    optimizer.step()
    # Print loss
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

If you are using a Dataset and DataLoader instead of calling the forward pass directly with model(X_continuous, X_one_hot, X_categorical), then you will want to implement __getitem__ in your custom Dataset sub-class that returns values as a tuple of x_continuous, x_one_hot, x_categorical to match what your forward implementation is expecting.

Extra: Time Series Analysis

Also, just cause you mentioned “historical nature”, you might want to check out the free book Forecasting: Principles and Practice which goes over some fundamental techniques in time series forecasting. The book uses R, but you can find Python implementations of some of these methods in statsmodels Time Series Analysis.