Entity Embeddings - Lesson 6 Course 2022

Hey folks,

I have been trying to use embeddings for the categorical features for the Blue Book of Bulldozers data as discussed in Lesson 6 of the course.
Here is the notebook:

The function to replace the categorical data with the embeddings is (copied from the forum)

def add_embeds(learn, x):

x = x.copy()
for i, cat in enumerate(cat_nn):
    emb = learn.embeds[i]
    vec = tensor(x[cat], dtype=torch.int64) # this is on cpu
    emb_data = emb(vec)
    emb_names = [f'{cat}_{j}' for j in range(emb_data.shape[1])]
    
    emb_df = pd.DataFrame(emb_data, index=x.index, columns=emb_names)
    x = x.drop(columns=cat)
    x = x.join(emb_df)
return x

and then I call the function using the learner and the dataset.
xs_emb = add_embeds(learn, df_nn_final)

I keep getting Index out of bounds error IndexError: index out of range in self

emb_data = emb(vec) This line throws the error.
vec is a tensor, vec = tensor(x[cat], dtype=torch.int64) of length of the dataset’s dimension torch.Size([412698]) while the first embedding’s dimension are torch.Size([73, 18])

Can anyone take a look, please?

1 Like

I believe I found the error, the dataframe I was passing in the function call was the pandas dataframe that wasn’t processed for the categorical data and wasn’t ‘categori-fied’.

What needed to be done was
add_embeds(learn, to_nn.xs)