Column type classification for a google merchant feed

(Albert Benatov) #1


in brief, I’d like to be able to classify each column of an unseen e-commerce feed to one of the following pre-defined column types: title, description, price, link, image_link, condition, brand, mpn, etc…

for anyone familiar with pushing feeds to google merchant, facebook, etc… the above should look familiar.

the UX of the app should provide a kind of mapping of the unseen feed columns to the above mentioned column types. that is, even if someone supplies a feed with a column named full_url, I should still be able to classify it as a link column type

unseen feed example:

id title_en desc_en full_price full_url img_url prod_condition prod_brand prod_gtin
nite-as-324 Nite running shoes Black Suitable for every day running, soft and comfortable... 299.50 EUR NEW Nite 003355847788
rbt-9432-i77 Ronibat swimming shorts Pink Suitable for night swimming... 99.75 CAD NEW Ronibat 05336543254

my ‘training/existing’ data is tabular and has the following structure:
original_column_name:full_url, list_price, desc_en_full
column_data: the actual content (like urls, img urls, title words, descriptions words, etc…)
column_type:title, description, price, link, image_link (this is the label I should predict)


original_column_name column_data language column_type (label)
title_original Addibar walking sandals Red en title
description_original_column_name Some description on Addibar walking sandals Red... en description
condition_original_column_name used as new en condition
price_b4_discount 299.88 GBP en price

I have about 3000 rows with the above structure, where less than 1000 are unique.

my questions are:

  • is this suitable for deep learning?
  • is it really a classification problem, like the sentimental analysis example in the fast ai course?
  • could it be solved using the collaborative course sample?
  • is there any logic in using an underlaying huge wiki language model for transfer learning? is there any hidden value in the text (word sequence). I have columns, where a language model doesn’t make sense, like urls, conditions (maybe), prices, ids
  • shouldn’t I instead ‘flatten’ the long textual data (for titles and descriptions), since I don’t care what’s written there, but more what type it is - is it long or short, how many unique words, word cardinality, etc…
  • on the other hand, with the above, I’ll lose linguistic analysis for similar columns, like brand vs category, or mpn vs id, etc
  • column content, such as link or image_link, can be easily detected using conventional methods rather than dl - regex for example
  • I am tempted to more conventional approaches, such as generating several features, and applying a naive bayes or another ml algo
  • or even worse: going for column names similarities (like Levenstein’s distance) to determine the column type

thanks in advance,

ps. I’d also like to label a column to ‘none’ or so, if it doesn’t fit any of the column types with enough confidence

1 Like

(Matthew Teschke) #2

Is it suitable for deep learning?
Deep learning should work well, but simpler approaches may be “good enough”

Is it really a classification problem
Yes, the first way I would approach it is like that. You may want to do some preprocessing to convert prices to a generic placeholder. Because, if a price is $198 or $197 or $190, those differences aren’t going to matter for classifying the text. All that you care about is that it is a price. So I’d replace it with something like fld_price

Could it be solved using the collaborative course sample?
I don’t see how that would work

Should you flatten it?
That’s certainly another approach you could try, but the concern you ask about after is valid

As you mention, simpler approaches like regex might be good enough, so I would try those first. Typically you want to start with a simpler approach and work up to more complicated. If the simpler approach works well enough - great!


(Albert Benatov) #3

thanks Matthew,

I’ve implemented a naive bayes classifier using the said simple methods. it works well.
I am doing a tweak in the feature section for guessing ‘obvious’ types like price, url, img, etc… it’s working accurate, except in a few exception cases (which are acceptable for now)

I’ll add also a ‘column name approximation’ feature, using additionally a column->column_orig_name lookup I have in my data. will post how it goes

but back to the deep learning - I am still going to implement a language model for the ‘text’ and ‘id’ column types. this should hopefully help in distinguishing between title and description better ( than my current token cardinality and len stats), between id and gtin,ean, detect brands

since I am dealing with multi-lingual data, I should have the language detected accurate and then I guess have a model-per-each-language

will post my progress