The Data Is Lying to You — And It Looks Perfect

How to deal with synthetic data, how to handle AI-generated answers, and how to reduce fraud in a world where the fakest inputs are the cleanest ones.

There's a specific kind of dread that hits a risk analyst when they open a freshly submitted dataset and everything lines up too well. No typos. No missing fields. No weird edge cases where someone said they were 112 years old or typed their phone number into the email box. Every response consistent, every timestamp rational, every demographic variable cross-validated against every other. That feeling — that quiet, creeping suspicion — used to be paranoia. Now it's due diligence.

We built a system that rewards perfect inputs. Fraudsters noticed.

For years, the fraud problem was obvious data. Stolen credentials. Mismatched SSNs. IDs with Photoshop artifacts around the edges. The defenses we built were calibrated against messy, visible deception — systems designed to catch people who were trying hard but not quite hard enough. That era is functionally over. What's replaced it is something the industry is still scrambling to name properly. Call it synthetic contamination. Call it AI-laundered inputs. The core of it is this: the tools that let legitimate organizations clean, enrich, and generate data are the exact same tools being used to fabricate it. And the outputs, increasingly, are indistinguishable to systems that were never designed to ask whether a real human produced them. The numbers are not subtle. Financial services saw fraud losses reach $12.5 billion in 2024 — up 25% over the prior year — with synthetic identity schemes driving a growing share of that figure. Lenders carried over $3 billion in direct exposure from accounts opened with fabricated identities. In digital onboarding specifically, more than 8% of attempts were flagged as potentially fraudulent in the first half of 2025. One in five first-party fraud cases detected last year involved a synthetic identity. These aren't rounding errors. This is structural.

What synthetic fraud actually looks like in practice

Let's be specific, because the word "synthetic" gets used in ways that obscure more than they reveal. A synthetic identity isn't a purely fictional character. The most effective ones are composites — real Social Security numbers (often belonging to children, the recently deceased, or people with thin credit files) combined with fabricated names, addresses, and phone numbers that have just enough digital history to pass a surface check. These personas get built slowly. They establish credit, make small payments, cultivate behavioral consistency. Then, when the credit limit is high enough or the account access is valuable enough, they disappear in what the industry calls a "bust-out" — the single most common fraud type by case volume, accounting for more than a fifth of all fraud incidents. AI-generated documents feed the same pipeline. Deepfake liveness checks, injected directly into video verification streams. Synthetic faces with statistically plausible features that defeat template-matching. Identity documents that look like they were issued by a real government because they were generated by a model trained on thousands of real government documents. And then there's the survey and research layer — where synthetic fraud is less dramatic but arguably more corrosive. Market research firms, public health organizations, academic institutions, and NGOs are all dealing with datasets where a meaningful percentage of responses were generated by LLMs, not human respondents. The responses aren't messy. They're internally consistent, demographically plausible, and empty of the small contradictions that real humans can't help but introduce. A dataset full of AI-generated survey responses will sail through most standard quality checks precisely because it was produced by something optimized to sound credible. This is the central irony of the moment: completeness is no longer a signal of quality. It might be the opposite.

The three failure modes we keep repeating

Before getting into what actually works, it's worth naming why so many defenses fail. First failure: Static rules in a dynamic attack environment. Rule-based fraud systems are fundamentally backward-looking. They catch the last attack well. Fraudsters who've spent any time studying detection patterns — and the good ones always have — simply adjust their inputs to avoid known tripwires. A rule that flags applications with mismatched area codes worked fine until fraudsters started using VOIP numbers consistent with their synthetic billing addresses. The rule stays; the fraud moves. Second failure: Trusting data that looks clean. This is the one that's hardest to rewire culturally. We've spent decades building systems that reward clean data — that treat completeness and formatting consistency as proxies for legitimacy. That assumption was always imperfect. In a world where LLMs can produce perfectly formatted, internally coherent datasets on demand, it's actively dangerous. The most dangerous data right now isn't the messy data. It's the data that's too good. Third failure: Fighting alone. Fraud operations have become coordinated across institutions. A synthetic identity persona might establish accounts at five or six financial institutions before executing the bust-out. Each institution, looking at its own signals in isolation, sees a customer who has been well-behaved for months. The picture only makes sense when you look across the network. Fraudsters collaborate. Defenders, for too long, haven't.

What Aether thinks actually moves the needle

We've been thinking about this for a while. Here's where we've landed. Treat behavioral continuity as the primary signal, not the secondary one. A document can be faked. A biometric template can be spoofed. What's significantly harder to fake — not impossible, but harder — is consistent, contextually coherent behavior over time across multiple interaction surfaces. Keystroke cadence. Scroll rhythm. Session timing patterns. The micro-decisions someone makes navigating a form. These aren't individually conclusive, but they build a profile that's expensive to replicate and nearly impossible to maintain consistently across a large-scale synthetic fraud operation. The leading detection platforms have known this for a while. More organizations need to operationalize it. Stop evaluating data points in isolation; evaluate relationships. The most revealing thing about AI-generated data isn't any single field — it's the relationships between fields. Age and education levels that correlate too cleanly with income brackets. Geographic data that never reflects the small inconsistencies of real mobility. Survey response patterns where the variance within a respondent's answers follows statistical distributions that real humans don't produce. Cross-variable validation isn't new methodology. What's new is the urgency. Manual spot-checks aren't sufficient anymore; this needs to be systematic. Invest in metadata. When a form is filled out in 90 seconds with no cursor hesitation, no field corrections, and a device fingerprint that doesn't match the stated location — that's a signal. When every respondent in a survey cohort submitted within a narrow timestamp window from similar device environments — that's a signal. The content of a submission and the metadata surrounding its production are increasingly divergent when fraud is involved. Building detection logic around metadata isn't glamorous, but it's where a lot of the actionable signal lives right now. Tiered friction, not uniform friction. One of the legitimate criticisms of fraud prevention as it's currently practiced is that it creates friction for everyone in order to stop a small percentage of bad actors. The answer isn't less friction — it's smarter deployment of friction. Low-risk interactions, well-established behavioral profiles, consistent device and location signals — these should flow. High-risk signals should trigger additional verification that's proportionate to the actual risk level. Dynamic risk scoring that adjusts in real time is the infrastructure requirement here. Organizations still running static risk tiers are leaving both security and customer experience on the table. Rethink what "verified" means for AI-assisted workflows. This is the uncomfortable one. As AI tools become standard in every professional context, the question of whether a human "produced" something becomes philosophically messy. An analyst who used an LLM to draft a response — and then reviewed, edited, and approved it — produced something different from a fully automated submission. The distinction matters for research integrity, for compliance, for legal defensibility. The industry needs better frameworks for what "human-in-the-loop" actually requires, what the minimum attestation standards are, and how to document the chain of production. We don't have those standards yet. Building them is overdue.

On the arms race framing — and why it's partly wrong

The narrative around AI fraud detection tends to default to arms race language: fraudsters get better tools, defenders get better tools, repeat forever. There's truth in that. But it frames the problem as purely technical, and it misses something important. The organizations that are doing this best aren't just running better models. They're building institutional knowledge about how fraud actually operates, investing in analyst capacity to interpret model outputs rather than just trusting them, and developing the organizational reflexes to update defenses quickly when attack patterns shift. Technology is necessary but not sufficient. The human judgment layer — the analyst who looks at a dataset and feels that quiet nagging doubt — still matters. The goal isn't to automate that away. It's to give that analyst better instruments. There's also a collaboration dimension that the arms race framing misses. Information sharing between institutions, between sectors, between public and private actors — this is where significant leverage lives. Fraudsters don't respect organizational boundaries. The networks they operate across are more connected than the defenses trying to stop them. Addressing that asymmetry requires coordination that individual institutions can't achieve alone.

Where this leaves us

The short version: the ambient level of synthetic contamination in data across almost every industry is higher than most organizations currently assume. The tools to produce convincing synthetic inputs are widely available, improving rapidly, and increasingly accessible to non-technical actors. The defenses that were calibrated for a previous generation of fraud — messy, obvious, individually identifiable — are structurally mismatched to what's actually hitting systems right now. That's the honest assessment of the situation. It's not a reason for fatalism. The same sophistication that's enabling better fraud is enabling better detection. Behavioral signals, cross-variable validation, metadata analysis, and real-time risk scoring are all genuinely powerful when implemented well. The organizations that are building that infrastructure now are building durable competitive advantage — not just in security, but in the trust that underlies every downstream business relationship. But getting there requires being honest about a thing that's uncomfortable to say out loud: a meaningful portion of the data your systems currently treat as clean probably isn't. Not all of it is fraud. Some of it is negligence, some is AI-assisted corner-cutting that doesn't rise to the level of intentional deception. But the category of "inputs that look legitimate and aren't" is large, growing, and not going to self-correct. The first step is admitting that the problem is here. Most organizations are still working on that part.

Aether Opinion covers risk, data intelligence, and the operational realities of fraud in a synthetic-data world. Views expressed represent our current thinking and are subject to revision as the landscape continues to move.