in brief, I’d like to be able to classify each column of an unseen e-commerce feed to one of the following pre-defined column types: title, description, price, link, image_link, condition, brand, mpn, etc…
for anyone familiar with pushing feeds to google merchant, facebook, etc… the above should look familiar.
the UX of the app should provide a kind of mapping of the unseen feed columns to the above mentioned column types. that is, even if someone supplies a feed with a column named full_url, I should still be able to classify it as a link column type
unseen feed example:
|nite-as-324||Nite running shoes Black||Suitable for every day running, soft and comfortable...||299.50 EUR||https://nite-runner.com/nite-as-324||https://nite-runner.com/nite-as-324-89583.jpg||NEW||Nite||003355847788|
|rbt-9432-i77||Ronibat swimming shorts Pink||Suitable for night swimming...||99.75 CAD||https://ronibat.com/rbt-9432-i77||https://nite-runner.com/rbt-9432-i77.jpg||NEW||Ronibat||05336543254|
my ‘training/existing’ data is tabular and has the following structure:
original_column_name:full_url, list_price, desc_en_full
column_data: the actual content (like urls, img urls, title words, descriptions words, etc…)
column_type:title, description, price, link, image_link (this is the label I should predict)
|title_original||Addibar walking sandals Red||en||title|
|description_original_column_name||Some description on Addibar walking sandals Red...||en||description|
|condition_original_column_name||used as new||en||condition|
I have about 3000 rows with the above structure, where less than 1000 are unique.
my questions are:
- is this suitable for deep learning?
- is it really a classification problem, like the sentimental analysis example in the fast ai course?
- could it be solved using the collaborative course sample?
- is there any logic in using an underlaying huge wiki language model for transfer learning? is there any hidden value in the text (word sequence). I have columns, where a language model doesn’t make sense, like urls, conditions (maybe), prices, ids
- shouldn’t I instead ‘flatten’ the long textual data (for titles and descriptions), since I don’t care what’s written there, but more what type it is - is it long or short, how many unique words, word cardinality, etc…
- on the other hand, with the above, I’ll lose linguistic analysis for similar columns, like brand vs category, or mpn vs id, etc
- column content, such as link or image_link, can be easily detected using conventional methods rather than dl - regex for example
- I am tempted to more conventional approaches, such as generating several features, and applying a naive bayes or another ml algo
- or even worse: going for column names similarities (like Levenstein’s distance) to determine the column type
thanks in advance,
ps. I’d also like to label a column to ‘none’ or so, if it doesn’t fit any of the column types with enough confidence