Column with the text description: embeddings, tokenization, neither or both?


Let’s assume we do have a dataset with columns like:
['paint_id, ‘paint_width’, ‘paint_height’, ‘author’, ‘style’, (…) ‘text_description’, ‘some_random_column_to_predict’]

Our goal is to predict the last column with float.

The df has cat_vars && contin_vars, so we could go with ColumnarModelData.from_data_frame().

To have a good use of embeddings, should we CountVectorizer() + tf-idf the ‘text_description’ column first or just leave the text as it is and fully rely on embeddings with larger size (200-500) or do something else? What would be the best approach here?

How are you doing it guys?

Ah, interesting, so you’re wondering if you could get away with treating the text_description column as categorical? I had honestly never thought of that :slight_smile: I have to imagine that wouldn’t work super well, since if you feed the network an example with a text_description even slightly different from anything it’s seen before, it would have to treat it as an instance of a “unknown” text_description category, no?

I’m curious too though about the best way to handle text columns. I would guess that, like you were suggesting, you’ll want to preprocess them to a fixed size vector, e.g. a bag of words. It would be neat to try something fancier like feeding the text column to an RNN.

1 Like

Yeah feeding into an RNN is probably best if it varies enough that categorical embedding won’t work

1 Like

Has anyone tried working with this sort of data? This is really interesting because a lot of Kaggle competitions pertaining to products have descriptions in them. I’m working with EHR’s and along with categorical data, these medical records have clinical notes for each patient’s visit. And I’m trying to come up with a way of including that information for building a patient representation.

@jeremy, would it be possible for you to elaborate on your suggestion? When you mean feed that into an RNN are you talking about the entire data including the structured part or just the textual part and somehow integrate it back into the entity embedding to feed a new neural network?

Just the textual part. Then fuse it with the tabular data in fully connected layers.

So we would train 2 neural networks? One separately with just the texts (descriptions/notes etc). Once we have trained that, we would rip out the weights (embeddings?) and just concat that with the input embeddings of the categorical variables from the structured data to train a new neural network for whatever we are predicting.

Did I understand that correctly or am I just making no sense?

Train it all end to end! :slight_smile:

Unfortunately, I’m not smart enough to understand what that means or how to do that :frowning: Let me start out a new thread with a specific Kaggle project and share my progress there with this kind of data. I’m hoping to gain enough insight from the smart people here to solve this problem.


This isn’t something you probably want to tackle until you’ve completed both parts of the course. It’s not about being smart, but about having the foundational knowledge and skills you’ll need to do this project.

Thank you for your reply. I look forward to the release of the 2nd part of the course. Before then I will honing the skills that I learned in the first part.

Its reassuring that its something that is indeed doable!