Entity Resolution Across Intelligence Data Sources: Matching Names, Aliases, and Identities at Scale
R. TanakaEntity resolution is one of those problems that sounds boring until you're staring at a database where "Mohammed Al-Rashid," "M. Alrashid," "محمد الراشد," and "Muhamed al-Rasheed" are four separate records — and they're all the same person. At scale, across dozens of data sources, this isn't a minor inconvenience. It's the difference between a complete intelligence picture and a dangerously fragmented one.
Photo by Google DeepMind on Pexels.
Traditional approaches lean on deterministic matching: if the name, date of birth, and passport number align, it's a match. Clean, auditable, fast. Also completely inadequate when adversaries deliberately vary spellings, use aliases across platforms, or operate through front entities. The moment data gets noisy — and in intelligence work, data is always noisy — deterministic rules collapse.
What ML Actually Brings to the Problem
Probabilistic matching has existed for decades. What's changed is the depth of signal modern models can exploit.
A well-trained entity resolution pipeline doesn't just compare strings. It encodes contextual features: co-occurrence patterns (who else appears in the same documents?), behavioral fingerprints (communication timing, platform habits, linguistic signatures), geographic trajectories, and network position. Two records that share no name overlap can still resolve to the same entity if their behavioral profiles are statistically indistinguishable.
The core pipeline typically runs three stages:
graph TD
A[/Raw Records — Multi-Source/] --> B(Blocking & Candidate Generation)
B --> C{Feature Engineering}
C --> D[Similarity Scoring Model]
D --> E{Threshold Decision}
E --> F[Confirmed Match — Merge]
E --> G[Rejected — Keep Separate]
F --> H((Unified Entity Graph))
Blocking is where most production systems either succeed or die. You cannot run pairwise comparisons across millions of records — the math doesn't work. Blocking reduces the candidate space by grouping records that share at least one plausible feature: a phonetic name variant, a geographic tag, a device identifier. Get blocking wrong and you either miss real matches or flood the scoring model with noise.
Feature engineering for intelligence data is genuinely hard. Transliteration inconsistency alone — Arabic, Persian, and Pashto names rendered into Latin characters by different agencies, different countries, different decades — can produce dozens of legitimate variants for a single name. Models trained on standard romanization schemes fail on field-transliterated data. You need training corpora that reflect actual IC data messiness, not cleaned academic datasets.
Scoring is where transformer-based models have made the biggest difference. Fine-tuned cross-encoders, given a pair of entity representations, can assess similarity across name, context, and behavioral features simultaneously — outperforming older feature-concatenation approaches by meaningful margins on recall for aliases and transliteration variants.
The Alias Problem Deserves Its Own Section
Alias detection is not the same as transliteration matching. An alias is a deliberate identity — a separate name used for operational security, a nom de guerre, a pseudonym on a platform. The entity behind it may share zero string similarity with their true identity.
This is where behavioral and network signals matter more than any name-matching heuristic. If two accounts follow the same 40 obscure users, post at the same hours, use the same rare vocabulary, and go silent simultaneously — that's signal. Weighted graph similarity across co-occurrence networks, combined with embedding-space proximity from account text, can surface alias relationships that no name-matching system would ever find.
One approach worth knowing: train a Siamese network on known alias pairs drawn from previous investigations, then apply it to new candidate pairs. The model learns what "same person, different name" looks like in your specific data environment. It won't generalize perfectly to new adversary tradecraft, but it degrades more gracefully than rule-based systems.
Where These Systems Break
Honesty matters here. Entity resolution models fail in predictable ways.
They over-merge when names are common — "Wang Wei" or "Ahmed Hassan" appear tens of thousands of times in any large dataset. Without strong non-name features, merging on name similarity alone creates false unifications that corrupt downstream analysis. Precision suffers.
They under-merge on sophisticated adversaries who vary behavior deliberately — changing communication patterns, rotating devices, compartmentalizing aliases across platforms. A well-disciplined operator can defeat behavioral fingerprinting if they know what signals you're collecting.
And they produce confidence scores that analysts tend to treat as more reliable than they are. A 0.87 match probability is not a confirmed identity. Building human review checkpoints into the pipeline — especially for high-consequence merge decisions — isn't optional. It's how you catch the systematic errors before they propagate into the entity graph and corrupt everything downstream.
The goal is a system that narrows the problem space enough that skilled analysts can make the final call on ambiguous cases with real evidence in front of them. Machine intelligence doing the heavy lifting on volume; human judgment handling the edge cases that actually matter.
Get Intel DevOps AI in your inbox
New posts delivered directly. No spam.
No spam. Unsubscribe anytime.