Using De-duplication, Fuzzy Matching & Schema Unification

Jul 23

In retail supply chains, supplier data is often:

Stored across systems (ERP, spreadsheets, portals)
Structured differently (e.g., Supplier_ID vs VendorCode)
Duplicated with slight variations (Acme Inc. vs Acme, Inc.)
Missing contact fields or using inconsistent country codes

Our Approach (Step-by-Step)

We'll walk through:

Step 1 – Ingest data from multiple sources  
Step 2 – Normalize & unify schema  
Step 3 – De-duplicate using fuzzy matching  
Step 4 – Assign canonical IDs  
Step 5 – Store as a single vendor master table

High level flow

Sample Python Code – Fuzzy Vendor Deduplication

import pandas as pd
from thefuzz import fuzz
from thefuzz import process
import uuid

# Sample: Fragmented input sources
df_a = pd.DataFrame({
    'VendorName': ['Acme Inc.', 'Global Textiles', 'Sun Traders'],
    'Country': ['US', 'IN', 'US'],
    'Email': ['contact@acme.com', 'info@globaltextile.co', 'sales@suntraders.com']
})

df_b = pd.DataFrame({
    'Supplier_Name': ['Acme Incorporated', 'Global Textiles Ltd.', 'Sun Trading Co.'],
    'Country_Code': ['USA', 'IND', 'USA'],
    'EmailAddress': ['support@acme.com', None, 'info@suntraders.com']
})

# Step 2: Normalize schema
def unify_schema(df, mappings):
    df = df.rename(columns=mappings)
    return df[['VendorName', 'Country', 'Email']]

df_b['VendorName'] = df_b['Supplier_Name']
df_b['Country'] = df_b['Country_Code']
df_b['Email'] = df_b['EmailAddress']
df_b = df_b[['VendorName', 'Country', 'Email']]

combined = pd.concat([df_a, df_b]).reset_index(drop=True)

# Step 3: Deduplicate using fuzzy match on VendorName
def dedupe_vendors(df, threshold=85):
    matched_indices = set()
    vendor_groups = []
    
    for i, row_i in df.iterrows():
        if i in matched_indices:
            continue
        group = [i]
        name_i = row_i['VendorName']
        for j, row_j in df.iterrows():
            if j <= i or j in matched_indices:
                continue
            score = fuzz.token_sort_ratio(name_i.lower(), row_j['VendorName'].lower())
            if score >= threshold:
                group.append(j)
                matched_indices.add(j)
        vendor_groups.append(group)
    
    return vendor_groups

groups = dedupe_vendors(combined)

# Step 4: Assign Canonical ID
canonical_vendors = []
for group in groups:
    canonical_id = str(uuid.uuid4())[:8]
    for idx in group:
        row = combined.loc[idx].to_dict()
        row['CanonicalVendorID'] = canonical_id
        canonical_vendors.append(row)

vendor_master = pd.DataFrame(canonical_vendors)

print("✅ Canonical Vendor Master Table:")
print(vendor_master)

Learnings

Fuzzy matching alone isn't enough. We also use email/domain, phone, or address to boost accuracy.
Threshold tuning is key. For vendor names, we found 85+ to be safe with manual review.
Assigning UUIDs or internal Canonical IDs ensures downstream traceability.

Business Impact

Reduced supplier duplicates by 67%
Enabled better spend visibility and performance scoring
Improved onboarding automation with a single vendor truth

Supplier OnboardingDuplicateData LabelingQuality

ManoloAI Engineering

Using De-duplication, Fuzzy Matching & Schema Unification

Our Approach (Step-by-Step)

High level flow

Sample Python Code – Fuzzy Vendor Deduplication

Learnings

Business Impact

About Us

Industries

Articles

Careers

Request a Consult

Services

News and Insights

Engineering Blog

Using De-duplication, Fuzzy Matching & Schema Unification

Our Approach (Step-by-Step)

High level flow

Sample Python Code – Fuzzy Vendor Deduplication

Learnings

Business Impact

Using Embeddings to Clean and Group Product Catalogs