Using Embeddings to Clean and Group Product Catalogs
How Transformer Models Improve Product Categorization for Retailers
A few months ago, as part of customer discovery, I was able to observe the store associates at Belk which is a multi story retailer in the Southeast. As the associate was picking & packing items, specifically during the picking process I noticed that she was struggling with how the item was labeled on the shirt vs what it showed up in the product listing in the system. She would consistently pause, spend a few minutes comparing the upc code. Her picking velocity dropped and she was racing to keep up the SLA for the day. This is not unique to Belk and I have observed this with a lot of retailers.
Some nuances that retailers see in the product listings
“Blue Men’s Polo Shirt – L” vs. “Men Polo Tee Large Blue”
Different categories assigned across brands: Apparel -> Tops vs Clothing -> Men -> Shirts
Duplicate or near-duplicate listings, missing metadata, or wrong labels
These inconsistencies break downstream analytics, personalization, inventory decisions, and search.
At ManoloAI, we implemented a transformer-based embeddings to group, cluster, and clean product catalogs, replacing rule-based string matching with vector-powered semantic understanding for a mid sized retailer and they saw improvements in the conversion rates and search results.
Traditional Approach: Fragile and Manual
Historically, catalog cleanup used:
Keyword rules and regex
Manual mapping tables
TF-IDF cosine similarity
Problem: These methods don't understand the meaning that "Men’s Polo" and "Polo Tee" could be treated as different products.
Transformers and Embeddings
Transformer models like BERT
, DistilBERT
, or SentenceTransformers
encode product titles/descriptions into dense vectors (aka embeddings) that capture semantic similarity, not just keywords.
"Red cotton shirt" and "Cotton tee in red" will live close together in vector space, even if they share no exact words
How we setup the workflows
Ingest and Preprocess Product Titles
Generate Embeddings using a Transformer model
Cluster Similar Items (e.g. KMeans, HDBSCAN)
Manually Label Clusters (if needed)
Use Clean Clusters as Canonical Categories or for Deduplication
Sample Python Code: Sentence-BERT + Clustering
And here is the Output of the workflow
** Notice how the model groups similar concepts, even when words differ.
Why This Works for Retail
Handles brand variation: “Nike Jogger” vs “Athletic Pants”
Works across languages or abbreviations
Adapts as your catalog grows or diversifies
There are other use cases where this approach could also be used:
SKU rationalization
Marketplace onboarding
Search tuning and personalization
Product taxonomy rebuilding
Impact
Using this approach we cut down catalog de-duplication time by 60%
Improved classification F1 score by 20% compared to TF-IDF
Created auto-clustered collections that required minimal manual tagging
Some Tips from our team for Deployment
Use HDBSCAN for variable-size clusters (great for retail)
Combine embeddings with structured metadata (e.g., size, color, gender)
Visualize clusters with UMAP or t-SNE to QA results
Want to Clean Up Your Catalog?
If you're a retailer or marketplace struggling with catalog entropy, we can help. ManoloAI’s embedding-powered cleanup pipelines are customizable, scalable, and already proven in production.
Contact us or let us show you a demo using your real product data.