hi,
in brief, I’d like to be able to classify each column of an unseen e-commerce feed to one of the following pre-defined column types: title, description, price, link, image_link, condition, brand, mpn, etc…
for anyone familiar with pushing feeds to google merchant, facebook, etc… the above should look familiar.
the UX of the app should provide a kind of mapping of the unseen feed columns to the above mentioned column types. that is, even if someone supplies a feed with a column named full_url, I should still be able to classify it as a link column type
unseen feed example:
id | title_en | desc_en | full_price | full_url | img_url | prod_condition | prod_brand | prod_gtin |
---|---|---|---|---|---|---|---|---|
nite-as-324 | Nite running shoes Black | Suitable for every day running, soft and comfortable... | 299.50 EUR | https://nite-runner.com/nite-as-324 | https://nite-runner.com/nite-as-324-89583.jpg | NEW | Nite | 003355847788 |
rbt-9432-i77 | Ronibat swimming shorts Pink | Suitable for night swimming... | 99.75 CAD | https://ronibat.com/rbt-9432-i77 | https://nite-runner.com/rbt-9432-i77.jpg | NEW | Ronibat | 05336543254 |
my ‘training/existing’ data is tabular and has the following structure:
original_column_name:full_url, list_price, desc_en_full
column_data: the actual content (like urls, img urls, title words, descriptions words, etc…)
language:en,nl,de,fr
column_type:title, description, price, link, image_link (this is the label I should predict)
example:
original_column_name | column_data | language | column_type (label) |
---|---|---|---|
title_original | Addibar walking sandals Red | en | title |
description_original_column_name | Some description on Addibar walking sandals Red... | en | description |
condition_original_column_name | used as new | en | condition |
price_b4_discount | 299.88 GBP | en | price |
I have about 3000 rows with the above structure, where less than 1000 are unique.
my questions are:
- is this suitable for deep learning?
- is it really a classification problem, like the sentimental analysis example in the fast ai course?
- could it be solved using the collaborative course sample?
- is there any logic in using an underlaying huge wiki language model for transfer learning? is there any hidden value in the text (word sequence). I have columns, where a language model doesn’t make sense, like urls, conditions (maybe), prices, ids
- shouldn’t I instead ‘flatten’ the long textual data (for titles and descriptions), since I don’t care what’s written there, but more what type it is - is it long or short, how many unique words, word cardinality, etc…
- on the other hand, with the above, I’ll lose linguistic analysis for similar columns, like brand vs category, or mpn vs id, etc
- column content, such as link or image_link, can be easily detected using conventional methods rather than dl - regex for example
- I am tempted to more conventional approaches, such as generating several features, and applying a naive bayes or another ml algo
- or even worse: going for column names similarities (like Levenstein’s distance) to determine the column type
thanks in advance,
Albert
ps. I’d also like to label a column to ‘none’ or so, if it doesn’t fit any of the column types with enough confidence