Using Embeddings to Clean and Group Product Catalogs

Jul 9

How Transformer Models Improve Product Categorization for Retailers

A few months ago, as part of customer discovery, I was able to observe the store associates at Belk which is a multi story retailer in the Southeast. As the associate was picking & packing items, specifically during the picking process I noticed that she was struggling with how the item was labeled on the shirt vs what it showed up in the product listing in the system. She would consistently pause, spend a few minutes comparing the upc code. Her picking velocity dropped and she was racing to keep up the SLA for the day. This is not unique to Belk and I have observed this with a lot of retailers.

Some nuances that retailers see in the product listings

“Blue Men’s Polo Shirt – L” vs. “Men Polo Tee Large Blue”
Different categories assigned across brands: Apparel -> Tops vs Clothing -> Men -> Shirts
Duplicate or near-duplicate listings, missing metadata, or wrong labels

These inconsistencies break downstream analytics, personalization, inventory decisions, and search.

At ManoloAI, we implemented a transformer-based embeddings to group, cluster, and clean product catalogs, replacing rule-based string matching with vector-powered semantic understanding for a mid sized retailer and they saw improvements in the conversion rates and search results.

Traditional Approach: Fragile and Manual

Historically, catalog cleanup used:

Keyword rules and regex
Manual mapping tables
TF-IDF cosine similarity

Problem: These methods don't understand the meaning that "Men’s Polo" and "Polo Tee" could be treated as different products.

Transformers and Embeddings

Transformer models like BERT, DistilBERT, or SentenceTransformers encode product titles/descriptions into dense vectors (aka embeddings) that capture semantic similarity, not just keywords.

"Red cotton shirt" and "Cotton tee in red" will live close together in vector space, even if they share no exact words

How we setup the workflows

Ingest and Preprocess Product Titles
Generate Embeddings using a Transformer model
Cluster Similar Items (e.g. KMeans, HDBSCAN)
Manually Label Clusters (if needed)
Use Clean Clusters as Canonical Categories or for Deduplication

Sample Python Code: Sentence-BERT + Clustering

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np

# Sample product titles from messy catalog
titles = [
    "Men's Blue Polo Shirt - L",
    "Large Polo Tee Blue for Men",
    "Women's Running Shoes Size 7",
    "Size 7 Ladies Jogging Footwear",
    "Organic Cotton Baby Onesie - 0-6M"
]

# Step 1: Generate Embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(titles)

# Step 2: Cluster embeddings (simple KMeans example)
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(embeddings)

# Step 3: Display results
df = pd.DataFrame({'Title': titles, 'Cluster': labels})
print(df.sort_values(by='Cluster'))

And here is the Output of the workflow

                Title                            Cluster
 Men's Blue Polo Shirt - L                         0
 Large Polo Tee Blue for Men                       0
 Women's Running Shoes Size 7                      1
 Size 7 Ladies Jogging Footwear                    1
 Organic Cotton Baby Onesie - 0-6M                 2

** Notice how the model groups similar concepts, even when words differ.

Why This Works for Retail

Handles brand variation: “Nike Jogger” vs “Athletic Pants”
Works across languages or abbreviations
Adapts as your catalog grows or diversifies

There are other use cases where this approach could also be used:

SKU rationalization
Marketplace onboarding
Search tuning and personalization
Product taxonomy rebuilding

Impact

Using this approach we cut down catalog de-duplication time by 60%
Improved classification F1 score by 20% compared to TF-IDF
Created auto-clustered collections that required minimal manual tagging

Some Tips from our team for Deployment

Use HDBSCAN for variable-size clusters (great for retail)
Combine embeddings with structured metadata (e.g., size, color, gender)
Visualize clusters with UMAP or t-SNE to QA results

Want to Clean Up Your Catalog?

If you're a retailer or marketplace struggling with catalog entropy, we can help. ManoloAI’s embedding-powered cleanup pipelines are customizable, scalable, and already proven in production.

ManoloAI Engineering

Using Embeddings to Clean and Group Product Catalogs

Traditional Approach: Fragile and Manual

Historically, catalog cleanup used:

Transformers and Embeddings

How we setup the workflows

Sample Python Code: Sentence-BERT + Clustering

Why This Works for Retail

Impact

Some Tips from our team for Deployment

Want to Clean Up Your Catalog?

About Us

Supply Chain Enterprise

Articles

Case Studies

Request a Consult

Mid Market

News and Insights

Engineering Blog

Using Embeddings to Clean and Group Product Catalogs

Traditional Approach: Fragile and Manual

Historically, catalog cleanup used:

Transformers and Embeddings

How we setup the workflows

Sample Python Code: Sentence-BERT + Clustering

Why This Works for Retail

Impact

Some Tips from our team for Deployment

Want to Clean Up Your Catalog?

Using De-duplication, Fuzzy Matching & Schema Unification

Prompt Engineering for Retail Agents: Lessons from the Field