Using Embeddings to Clean and Group Product Catalogs

How Transformer Models Improve Product Categorization for Retailers

A few months ago, as part of customer discovery, I was able to observe the store associates at Belk which is a multi story retailer in the Southeast. As the associate was picking & packing items, specifically during the picking process I noticed that she was struggling with how the item was labeled on the shirt vs what it showed up in the product listing in the system. She would consistently pause, spend a few minutes comparing the upc code. Her picking velocity dropped and she was racing to keep up the SLA for the day. This is not unique to Belk and I have observed this with a lot of retailers.

Some nuances that retailers see in the product listings

  • “Blue Men’s Polo Shirt – L” vs. “Men Polo Tee Large Blue”

  • Different categories assigned across brands: Apparel -> Tops vs Clothing -> Men -> Shirts

  • Duplicate or near-duplicate listings, missing metadata, or wrong labels

These inconsistencies break downstream analytics, personalization, inventory decisions, and search.

At ManoloAI, we implemented a transformer-based embeddings to group, cluster, and clean product catalogs, replacing rule-based string matching with vector-powered semantic understanding for a mid sized retailer and they saw improvements in the conversion rates and search results.

Traditional Approach: Fragile and Manual

Historically, catalog cleanup used:

  • Keyword rules and regex

  • Manual mapping tables

  • TF-IDF cosine similarity

Problem: These methods don't understand the meaning that "Men’s Polo" and "Polo Tee" could be treated as different products.

Transformers and Embeddings

Transformer models like BERT, DistilBERT, or SentenceTransformers encode product titles/descriptions into dense vectors (aka embeddings) that capture semantic similarity, not just keywords.

"Red cotton shirt" and "Cotton tee in red" will live close together in vector space, even if they share no exact words

How we setup the workflows

  1. Ingest and Preprocess Product Titles

  2. Generate Embeddings using a Transformer model

  3. Cluster Similar Items (e.g. KMeans, HDBSCAN)

  4. Manually Label Clusters (if needed)

  5. Use Clean Clusters as Canonical Categories or for Deduplication

Sample Python Code: Sentence-BERT + Clustering

from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans import pandas as pd import numpy as np # Sample product titles from messy catalog titles = [ "Men's Blue Polo Shirt - L", "Large Polo Tee Blue for Men", "Women's Running Shoes Size 7", "Size 7 Ladies Jogging Footwear", "Organic Cotton Baby Onesie - 0-6M" ] # Step 1: Generate Embeddings model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(titles) # Step 2: Cluster embeddings (simple KMeans example) kmeans = KMeans(n_clusters=3, random_state=42) labels = kmeans.fit_predict(embeddings) # Step 3: Display results df = pd.DataFrame({'Title': titles, 'Cluster': labels}) print(df.sort_values(by='Cluster'))

And here is the Output of the workflow

Title Cluster 0 Men's Blue Polo Shirt - L 0 1 Large Polo Tee Blue for Men 0 2 Women's Running Shoes Size 7 1 3 Size 7 Ladies Jogging Footwear 1 4 Organic Cotton Baby Onesie - 0-6M 2

** Notice how the model groups similar concepts, even when words differ.

Why This Works for Retail

  • Handles brand variation: “Nike Jogger” vs “Athletic Pants”

  • Works across languages or abbreviations

  • Adapts as your catalog grows or diversifies

There are other use cases where this approach could also be used:

  • SKU rationalization

  • Marketplace onboarding

  • Search tuning and personalization

  • Product taxonomy rebuilding

Impact

  • Using this approach we cut down catalog de-duplication time by 60%

  • Improved classification F1 score by 20% compared to TF-IDF

  • Created auto-clustered collections that required minimal manual tagging

Some Tips from our team for Deployment

  • Use HDBSCAN for variable-size clusters (great for retail)

  • Combine embeddings with structured metadata (e.g., size, color, gender)

  • Visualize clusters with UMAP or t-SNE to QA results

Want to Clean Up Your Catalog?

If you're a retailer or marketplace struggling with catalog entropy, we can help. ManoloAI’s embedding-powered cleanup pipelines are customizable, scalable, and already proven in production.

Contact us or let us show you a demo using your real product data.

Previous
Previous

Using De-duplication, Fuzzy Matching & Schema Unification

Next
Next

Prompt Engineering for Retail Agents: Lessons from the Field