At my current company we try to tag merchants with their cleaned names. Some examples are,
HYATT REGENCY CHICAGO F& --> HYATT REGENCY
WAL-MART SUPERCENTER 5039 --> WALMART
MARSHALLS M015 --> MARCHALLS
BP#8274714CIRCLE K ST 27 --> BP
HDD BRANDS # 8119 --> HDD
ABERCROMBIE & FITCH #0762BREA CA --> ABERCROMBIE & FITCH
BART-CLIPPER CIVIC CTR 5104646474 CA --> BART-CLIPPER
TICKET FULFILLMENT SERVIC -->TICKET FULFILLMENT SERVICES
MJ WILMINGTON HOTEL ASSOC --> MJ WILMINGTON HOTEL ASSOCIATES
IHOP #0583 --> IHOP
PAYPAL *IMPORTGET --> PAYPAL
GIANT FOOD INC #196 --> GIANT FOOD
STARBUCKS CORP 989079742503 --> STARBUCKS
HOME DEPOT 6557 --> HOME DEPOT
NORDSTROM DIRECT INC -->NORDSTROM
PLANTATION FINE WINE & GALLATIN TN --> PLANTATION FINE WINE & GALLATIN
COSTCO WHSE #1147 NEW ORLEANS LA -->COSTCO
CHIPOTLE 1956 MOORESVILLE NC --> CHIPOTLE
Currently we have some regular expressions through which we try to clean up junk and then extract the correct name and in some cases replace with new and better name.
We then use TF-IDF and DBSCAN clustering technique to try to cluster similar strings together.
As every month we have lote os new merchants(small merchants mostly) and we are dealing with millions of merchants here. Is there a better way to deal with this problem. Maybe Word Embeddings and then using LSTM or something. Any help and ideas are appreciated.
Note that through Regular Expressions we are able to tag top 500 merchants but this method is not scalable as we have millions of merchants and we can’t write rules for these small merchants.
Looking forward to hear any ideas to tackle this.