Glossary

Entity Resolution

Entity resolution is the process of identifying and merging records that refer to the same real-world entity across multiple datasets using matching algorithms.

CDP.com Staff CDP.com Staff 8 min read

Entity resolution is the process of identifying, matching, and merging records from multiple data sources that refer to the same real-world entity — whether a person, company, product, or location. Also known as record linkage, deduplication, or entity matching, it solves the fundamental problem that the same entity appears differently across systems: “John Smith” in the CRM, “J. Smith” in the billing system, and “johnsmith@email.com” in the marketing platform may all be the same person. Entity resolution determines which records belong together and consolidates them into a single, unified representation.

In the context of customer data, entity resolution is the technical foundation that enables customer data unification and the construction of a customer 360 profile. Without reliable entity resolution, organizations operate with fragmented, duplicated records that inflate audience counts, produce contradictory analytics, and deliver inconsistent customer experiences.

How Entity Resolution Works

Entity resolution typically follows a multi-stage pipeline:

1. Data Standardization

Before records can be compared, they must be normalized. This includes:

  • Format standardization: Converting phone numbers to a consistent format (e.g., +1-555-123-4567), standardizing addresses (abbreviations, postal codes), normalizing name casing
  • Data cleansing: Removing invalid entries, correcting typos, filling missing values where possible — a process closely related to data cleansing
  • Tokenization: Breaking fields into comparable components (first name, last name, street number, city)

2. Blocking

Comparing every record against every other record is computationally prohibitive at scale. A database with 10 million records would require 50 trillion pairwise comparisons. Blocking reduces this by grouping records into candidate sets (blocks) based on shared attributes — same zip code, same first three letters of last name, same email domain. Only records within the same block are compared, reducing computation by orders of magnitude while accepting a small risk of missing matches across blocks.

3. Pairwise Comparison

Within each block, records are compared across multiple fields using similarity functions:

  • Exact match: Email addresses, phone numbers, government IDs
  • String similarity: Jaro-Winkler, Levenshtein distance, or cosine similarity for names and addresses that may contain typos or variations
  • Phonetic encoding: Soundex or Metaphone algorithms that match names with similar pronunciation (“Smith” and “Smyth”)
  • Temporal proximity: Matching records created around the same time or associated with the same transaction date

4. Match Decision

Each pairwise comparison produces a similarity score. The system must then decide whether two records represent the same entity. This is where deterministic and probabilistic approaches diverge.

Deterministic vs. Probabilistic Matching

Deterministic matching applies strict rules: two records match if and only if specific fields are identical (e.g., exact email match OR exact phone + last name match). It is precise and transparent — every match can be explained by a specific rule. However, it misses valid matches when data is inconsistent, misspelled, or incomplete.

Probabilistic matching uses statistical models to calculate the likelihood that two records refer to the same entity based on the weighted similarity of multiple fields. A high similarity score across name, address, and purchase history may produce a match even if no single field is an exact match. Probabilistic methods catch more true matches but introduce the risk of false positives (merging records that are actually different entities).

Modern entity resolution systems often combine both approaches. Deterministic rules handle high-confidence matches (same email or loyalty ID), while probabilistic models resolve ambiguous cases where no single identifier is shared. Machine learning — particularly gradient-boosted trees and neural networks — has increasingly replaced hand-tuned probabilistic weights, learning optimal matching rules from labeled training data.

Entity Resolution in CDPs

Customer data platforms rely on entity resolution as a core capability, though it is often called identity resolution when applied specifically to customer records. The distinction is subtle: identity resolution is entity resolution applied to people, focusing on linking anonymous and known identifiers (cookies, device IDs, email addresses, CRM IDs) into unified customer profiles that form the basis of an identity graph.

Within a CDP, entity resolution operates continuously as new data arrives:

  • Streaming resolution: When a website visitor provides their email address at checkout, the CDP matches this known identifier to the visitor’s anonymous browsing history, merging the records in real time
  • Batch resolution: Overnight jobs process bulk data imports (CRM exports, point-of-sale files) and resolve entities against the existing profile store
  • Cross-device matching: Linking activity across a customer’s phone, laptop, tablet, and in-store interactions to a single profile

The quality of entity resolution directly determines the accuracy of everything downstream: audience segmentation, personalization, analytics, and AI model training. Poor resolution leads to duplicate profiles (inflated audience counts, redundant messages) or incorrect merges (combining two different customers into one profile, corrupting both).

Entity Resolution Beyond Customer Data

While customer identity resolution is the most common use case in marketing technology, entity resolution has broad applications:

  • B2B account matching: Linking company records across CRM, intent data providers, and firmographic databases — essential for B2B CDPs
  • Product catalog reconciliation: Matching the same product listed under different names or SKUs across marketplace platforms
  • Healthcare: Matching patient records across hospitals, insurers, and pharmacies to create longitudinal health records
  • Financial services: Linking accounts, transactions, and entities across institutions for fraud detection and regulatory compliance
  • Government: Connecting citizen records across tax, social services, voter registration, and law enforcement databases

Each domain has unique challenges. B2B entity resolution must handle company hierarchies (subsidiaries, divisions, acquired entities), while healthcare resolution must operate under strict data governance constraints (HIPAA in the US, GDPR in Europe).

Challenges in Entity Resolution

Scale: Enterprise datasets contain hundreds of millions of records. Efficient blocking strategies and distributed computing are essential to keep resolution tractable. Organizations increasingly rely on real-time data processing to handle entity resolution at scale without introducing latency.

Data quality: Garbage in, garbage out. If source data contains widespread errors, abbreviations, or missing fields, even sophisticated matching algorithms will struggle. Investing in upstream data integration and cleansing pays dividends in resolution accuracy.

Privacy regulations: Entity resolution inherently involves linking data about individuals across sources, which raises privacy concerns. GDPR, CCPA, and other regulations require organizations to maintain clear consent trails and provide individuals the right to access, correct, or delete their unified records.

Evolving identities: People change names, addresses, phone numbers, and email addresses over time. Entity resolution must handle these temporal changes without fragmenting or incorrectly merging profiles.

FAQ

What is the difference between entity resolution and identity resolution?

Entity resolution is the general computer science discipline of matching and merging records that refer to the same real-world entity across datasets — applicable to people, companies, products, locations, or any other entity type. Identity resolution is entity resolution applied specifically to people and customer records, typically within marketing and customer data platforms. Identity resolution often emphasizes linking anonymous digital identifiers (cookies, device IDs) with known identifiers (email, phone, CRM ID) to build unified customer profiles, a use case that general entity resolution frameworks may not prioritize.

How do deterministic and probabilistic matching differ in entity resolution?

Deterministic matching uses strict, rule-based criteria — two records match only if specified fields are exactly identical (e.g., same email address or same phone number plus last name). It is precise and explainable but misses matches when data contains typos, format variations, or incomplete fields. Probabilistic matching calculates a statistical likelihood that two records refer to the same entity by weighing similarity scores across multiple fields. It catches more true matches but risks false positives. Most production systems combine both: deterministic rules for high-confidence matches and probabilistic models for ambiguous cases.

How do CDPs use entity resolution to build customer profiles?

CDPs run entity resolution continuously as data arrives from websites, mobile apps, CRMs, point-of-sale systems, and other sources. When a new record enters the system, the CDP compares it against existing profiles using deterministic rules (exact email or phone match) and probabilistic models (name plus address similarity). Matched records are merged into a unified profile that aggregates all known attributes and interaction history. This process enables accurate audience segmentation, personalized experiences, and reliable analytics by ensuring each real customer is represented by a single profile rather than scattered across duplicate records.

  • Golden Record — The single, authoritative profile that entity resolution produces by merging matched records
  • Data Validation — Quality checks applied to source data before and after entity resolution to ensure accuracy
  • Single Customer View (SCV) — The unified profile that downstream systems consume, built on resolved entities
  • Data Enrichment — The process of appending additional attributes to resolved entity profiles from external sources
CDP.com Staff
Written by
CDP.com Staff

The CDP.com staff has collaborated to deliver the latest information and insights on the customer data platform industry.