Using De-duplication, Fuzzy Matching & Schema Unification
In retail supply chains, supplier data is often:
Stored across systems (ERP, spreadsheets, portals)
Structured differently (e.g.,
Supplier_ID
vsVendorCode
)Duplicated with slight variations (
Acme Inc.
vsAcme, Inc.
)Missing contact fields or using inconsistent country codes
Our Approach (Step-by-Step)
We'll walk through:
Step 1 – Ingest data from multiple sources
Step 2 – Normalize & unify schema
Step 3 – De-duplicate using fuzzy matching
Step 4 – Assign canonical IDs
Step 5 – Store as a single vendor master table
High level flow
Flow of Supplier Data
Sample Python Code – Fuzzy Vendor Deduplication
import pandas as pd
from thefuzz import fuzz
from thefuzz import process
import uuid
# Sample: Fragmented input sources
df_a = pd.DataFrame({
'VendorName': ['Acme Inc.', 'Global Textiles', 'Sun Traders'],
'Country': ['US', 'IN', 'US'],
'Email': ['contact@acme.com', 'info@globaltextile.co', 'sales@suntraders.com']
})
df_b = pd.DataFrame({
'Supplier_Name': ['Acme Incorporated', 'Global Textiles Ltd.', 'Sun Trading Co.'],
'Country_Code': ['USA', 'IND', 'USA'],
'EmailAddress': ['support@acme.com', None, 'info@suntraders.com']
})
# Step 2: Normalize schema
def unify_schema(df, mappings):
df = df.rename(columns=mappings)
return df[['VendorName', 'Country', 'Email']]
df_b['VendorName'] = df_b['Supplier_Name']
df_b['Country'] = df_b['Country_Code']
df_b['Email'] = df_b['EmailAddress']
df_b = df_b[['VendorName', 'Country', 'Email']]
combined = pd.concat([df_a, df_b]).reset_index(drop=True)
# Step 3: Deduplicate using fuzzy match on VendorName
def dedupe_vendors(df, threshold=85):
matched_indices = set()
vendor_groups = []
for i, row_i in df.iterrows():
if i in matched_indices:
continue
group = [i]
name_i = row_i['VendorName']
for j, row_j in df.iterrows():
if j <= i or j in matched_indices:
continue
score = fuzz.token_sort_ratio(name_i.lower(), row_j['VendorName'].lower())
if score >= threshold:
group.append(j)
matched_indices.add(j)
vendor_groups.append(group)
return vendor_groups
groups = dedupe_vendors(combined)
# Step 4: Assign Canonical ID
canonical_vendors = []
for group in groups:
canonical_id = str(uuid.uuid4())[:8]
for idx in group:
row = combined.loc[idx].to_dict()
row['CanonicalVendorID'] = canonical_id
canonical_vendors.append(row)
vendor_master = pd.DataFrame(canonical_vendors)
print("✅ Canonical Vendor Master Table:")
print(vendor_master)
Learnings
Fuzzy matching alone isn't enough. We also use email/domain, phone, or address to boost accuracy.
Threshold tuning is key. For vendor names, we found 85+ to be safe with manual review.
Assigning UUIDs or internal Canonical IDs ensures downstream traceability.
Business Impact
Reduced supplier duplicates by 67%
Enabled better spend visibility and performance scoring
Improved onboarding automation with a single vendor truth