Explanation of "index out of range" Errors

bsalita · July 28, 2024, 8:35pm

Getting pesky “index out of range” errors? The main issue is that the error message isn’t descriptive. The usual reason for the error is because of a failure to properly load data. I didn’t have that. I’ve spent days trying to nail down where the error is occurring and why it occurs. Here’s what’s happening as demonstrated in the program below. Let me explain in terms of a tabular data classification problem. The y variable is a categorical variable, perhaps a string dtype. The data is split into train and validate sets. If a class of the category is not in the training set but is in the validation set, an error occurs. The training set doesn’t need to contain all classes of the category. It only needs to have all classes which are in the validation set. Of course, a training set not training on all classes in the test set is a separate issue. An “index out of range” error is thrown when fit() detects a class in the validation set which is not in the training set. In the example below, contracts is assigned all possible classes of category “Contract”. The training set is randomly selected and will normally contain almost all of the the classes. However, occasionally some classes are not randomly chosen. If any one of those not-chosen classes are in the validation set, the error occurs. Possible approaches to avoiding the issue is to drop any validation set data having classes which are not in the training set, or making sure that all classes are represented in the training data.

import pandas as pd
import random
from fastai.tabular.all import *

contracts_all = ['PASS']+[str(level+1)+strain+dbl+direction for level in range(7) for strain in 'CDHSN' for dbl in ['','X','XX'] for direction in 'NESW']
directions_all = ['N','E','S','W']
mp_total_n_all = range(0,10000)

print(len(contracts_all))
k = 3000 # needs to be at least 4000 to avoid IndexError: index out of range in self
#np.random.seed(42) doesn't work
#random.seed(42) doesn't work
d = {
    'Contract':random.choices(contracts_all,k=k),
    'Dealer':random.choices(directions_all,k=k),
    'mp_total_n':random.choices(mp_total_n_all,k=k)
}
df = pd.DataFrame(d)
df.info(verbose=True)
print(df)

splits = RandomSplitter()(range_of(df))
missing_values_in_training_data = set(contracts_all).symmetric_difference(df['Contract'][splits[0]])
missing_values_in_validate_data = set(contracts_all).symmetric_difference(df['Contract'][splits[1]])
values_missing_in_training_but_are_in_validation_data = missing_values_in_training_data.difference(missing_values_in_validate_data)
print('missing values in training:',missing_values_in_training_data)
print('missing values in validation:',missing_values_in_validate_data)
print('values missing in training but are in validation data:',values_missing_in_training_but_are_in_validation_data)

y_names = ['Contract']
cat_names = ['Dealer']
cont_names = ['mp_total_n']
to = TabularPandas(
    df[y_names+cat_names+cont_names],
    procs=[Categorify, FillMissing, Normalize],
    cat_names=cat_names,
    cont_names=cont_names,
    y_names=y_names,
    splits=splits, # warn("Your generator is empty.") if I don't use range_of(df)
    y_block=CategoryBlock()
)

dls = to.dataloaders(100,device='cpu')

print(dls.classes)

learn = tabular_learner(dls)

dls.show_batch()

learn.fit_one_cycle(3)    # IndexError: index out of range occurs when a class is missing from the training data but not from the validation data
assert values_missing_in_training_but_are_in_validation_data == set(), values_missing_in_training_but_are_in_validation_data # should never assert because error occurs in learn.fit_one_cycle