Using De-duplication, Fuzzy Matching & Schema Unification

In retail supply chains, supplier data is often:

  • Stored across systems (ERP, spreadsheets, portals)

  • Structured differently (e.g., Supplier_ID vs VendorCode)

  • Duplicated with slight variations (Acme Inc. vs Acme, Inc.)

  • Missing contact fields or using inconsistent country codes

Our Approach (Step-by-Step)

We'll walk through:

Step 1 – Ingest data from multiple sources Step 2 – Normalize & unify schema Step 3 – De-duplicate using fuzzy matching Step 4 – Assign canonical IDs Step 5 – Store as a single vendor master table

High level flow

Flow of Supplier Data

Sample Python Code – Fuzzy Vendor Deduplication

import pandas as pd from thefuzz import fuzz from thefuzz import process import uuid # Sample: Fragmented input sources df_a = pd.DataFrame({ 'VendorName': ['Acme Inc.', 'Global Textiles', 'Sun Traders'], 'Country': ['US', 'IN', 'US'], 'Email': ['contact@acme.com', 'info@globaltextile.co', 'sales@suntraders.com'] }) df_b = pd.DataFrame({ 'Supplier_Name': ['Acme Incorporated', 'Global Textiles Ltd.', 'Sun Trading Co.'], 'Country_Code': ['USA', 'IND', 'USA'], 'EmailAddress': ['support@acme.com', None, 'info@suntraders.com'] }) # Step 2: Normalize schema def unify_schema(df, mappings): df = df.rename(columns=mappings) return df[['VendorName', 'Country', 'Email']] df_b['VendorName'] = df_b['Supplier_Name'] df_b['Country'] = df_b['Country_Code'] df_b['Email'] = df_b['EmailAddress'] df_b = df_b[['VendorName', 'Country', 'Email']] combined = pd.concat([df_a, df_b]).reset_index(drop=True) # Step 3: Deduplicate using fuzzy match on VendorName def dedupe_vendors(df, threshold=85): matched_indices = set() vendor_groups = [] for i, row_i in df.iterrows(): if i in matched_indices: continue group = [i] name_i = row_i['VendorName'] for j, row_j in df.iterrows(): if j <= i or j in matched_indices: continue score = fuzz.token_sort_ratio(name_i.lower(), row_j['VendorName'].lower()) if score >= threshold: group.append(j) matched_indices.add(j) vendor_groups.append(group) return vendor_groups groups = dedupe_vendors(combined) # Step 4: Assign Canonical ID canonical_vendors = [] for group in groups: canonical_id = str(uuid.uuid4())[:8] for idx in group: row = combined.loc[idx].to_dict() row['CanonicalVendorID'] = canonical_id canonical_vendors.append(row) vendor_master = pd.DataFrame(canonical_vendors) print("✅ Canonical Vendor Master Table:") print(vendor_master)

Learnings

  • Fuzzy matching alone isn't enough. We also use email/domain, phone, or address to boost accuracy.

  • Threshold tuning is key. For vendor names, we found 85+ to be safe with manual review.

  • Assigning UUIDs or internal Canonical IDs ensures downstream traceability.

Business Impact

  • Reduced supplier duplicates by 67%

  • Enabled better spend visibility and performance scoring

  • Improved onboarding automation with a single vendor truth

Next
Next

Using Embeddings to Clean and Group Product Catalogs