A new benchmark finds that AI-generated images make convincing refund fraud — current multimodal LLMs and specialized detectors routinely miss synthetic 'damaged' evidence and are inconsistent across generators, leaving e-commerce and service platforms exposed.

FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

Xinyu Yan, Boyang Chen, Jiaming Zhang, Tiantong Wu, Hong Xi Tae, Yichen He, Tiantong Wang, Yachun Mi, Yurong Hao, Yilei Zhao, Lei Xiao, Longtao Huang, Pengjun Xie, Wei Liu, Wei Yang Bryan Lim · May 09, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

FraudBench, a multimodal benchmark of real review evidence and synthetic fake-damage images, shows current MLLMs and specialized detectors frequently fail to detect AI-generated refund fraud, with many generator-specific fake-damage detection rates below 50% and inconsistent false-positive behavior on genuine damaged images.

Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.

Summary

Main Finding

FRAUDBENCH is a new multimodal benchmark that exposes a critical gap in current detection capabilities for AI-generated refund fraud: multimodal LLMs (MLLMs) tend to be over-credulous (miss many realistic, claim-conditioned fakes) while specialized detectors are over-sensitive (catch many fakes but also falsely flag genuine damaged evidence). This “over-credulity vs. over-sensitivity” trade-off creates asymmetric economic risks for platforms—false negatives produce direct financial losses from fraud, while false positives erode customer trust and generate reputational and operational costs.

Key Points

Purpose: Evaluate detection of AI-generated, claim-conditioned refund evidence (photos submitted with refund/claim text), not just generic real/fake classification.
Scope: 822 real review samples across 29 categories, 7,928 images total. Of these, 2,000 real images (988 real-undamaged, 1,012 real-damaged) and 5,928 synthetic fake-damaged images generated from the undamaged images.
Generative threat model: Six state-of-the-art image editing/generation models used with identical, image-specific prompts (examples include GPT Image 2, Nano Banana 2, Grok Imagine, Wan2.7-Image-Pro, Qwen-Image-2.0-Pro, Qwen-Image-Edit-Max).
Evaluation targets: 11 MLLMs, 4 specialized AI-image detectors, and human annotators under identical, multimodal conditions.
Core empirical findings:
- MLLMs: usually accept genuine damaged evidence but have very low true positive rates on many fake-damaged subsets (often below 50%).
- Specialized detectors: higher detection rates for synthetic edits but inconsistent across generators and prone to false positives on genuine damaged images.
- Humans: perform better than most MLLMs but still commit non-negligible errors—verifying highly realistic edits remains difficult for humans.
Benchmark design emphasizes realistic, multi-view evidence and claim-conditioned verification, with five evaluation axes: input modality (single vs multi-image), contextual information (review text), multi-step reasoning, prompt sensitivity, and real-image preservation.

Data & Methods

Data sources:
- Amazon Reviews 2023 (curated subset, 590 samples)
- Trip.com and GrabFood (230 service-oriented review samples)
- In-house captured samples (supplemental)
Preprocessing:
- MLLM-assisted screening for relevance, privacy, corrupted files, and image–review alignment.
- Human verification for MLLM-flagged samples.
- Distribution-aware sampling to keep category diversity and multi-image proportions.
- Anonymization and multilingual retention to reflect real platform usage.
Synthetic evidence generation:
- Extract category-specific damage patterns from real-damaged images (scratches, stains, dents, breakage, leakage, etc.).
- Use an MLLM to analyze each real-undamaged image + metadata to select a plausible damage pattern.
- Produce image-specific editing prompts; apply the same prompt across six SOTA image-editing/generation models to produce fake-damaged images.
- Generate matching fake review comments (carefully controlled to avoid text-only cues).
- Human-in-the-loop quality control to remove implausible edits, mismatch, or trivial artifacts.
Evaluation:
- Compare MLLMs, specialized detectors, and humans across the five evaluation dimensions.
- Measure both detection accuracy on fake-damaged and false positive rate on real-damaged (real-image preservation), and sensitivity to prompt wording and multi-view evidence.

Implications for AI Economics

Direct economic harms and asymmetric risks:
- False negatives (missed fakes) cause direct payouts, inventory replacement costs, and fraud losses—an immediate financial drain on platforms and sellers.
- False positives (flagging genuine claims) damage consumer trust, increase churn, raise customer support and appeal costs, and can harm small sellers’ reputations—indirect but persistent economic costs.
- Platforms face a classical precision–recall trade-off with uneven externalities: choosing thresholds involves weighing monetary loss against reputational and long-term market effects.
Operational and compliance costs:
- Platforms will need more resource-intensive, multimodal verification workflows (human-in-the-loop, multi-angle requirements, higher-quality metadata), increasing operational costs per claim.
- Investment priorities: specialized detectors (higher sensitivity) vs. multimodal reasoning systems (lower false positive tendency) — both require continuous retraining as generative models evolve, raising maintenance costs.
Market design and incentives:
- Platforms may tighten evidence standards (e.g., require time-stamped or provenance-enabled photos, in-app video, or multiple angles), which imposes friction on legitimate users and could reduce conversion or raise complaint resolution times.
- New products and pricing: insurers, fraud-detection-as-a-service, and third-party verification offerings may grow; platforms could shift liability or introduce escrow/deposit mechanisms to mitigate moral hazard.
- Adverse selection: increased verification friction or wrongful flags could drive away honest users and small sellers, concentrating activity among larger actors with more resources—affecting competition and fees.
Policy and regulatory implications:
- Standards for provenance (digital watermarks, content provenance APIs) and liability rules could be economically significant—mandates may reduce fraud but increase compliance costs.
- Regulators may require transparency in automated dispute decisions and appeal processes to limit consumer harm from false positives.
Research and investment priorities:
- Need for robust, cross-generator, multimodal detection methods that optimize for asymmetric loss functions (cost-sensitive detection thresholds).
- Cost–benefit analyses for deploying detection tech vs. alternative controls (escrow, insurance, manual review), to find economically efficient mixes.
- Development of benchmark-driven model-agnostic standards for evaluation, including real-image preservation metrics that capture reputational/externality costs.
Practical recommendations for platforms and policymakers:
- Adopt multimodal evidence requirements (e.g., short in-app video or timestamped multi-angle photos) to raise attack costs.
- Use layered defenses: lightweight detectors for triage, human review for high-risk or ambiguous cases, and conservative payout policies with clear appeal paths.
- Implement cost-sensitive detection thresholds and monitor economic outcomes (fraud payouts vs. customer retention).
- Encourage or mandate provenance tools and standardized logging to deter misuse and aid verification.
- Fund or require periodic red-team evaluations against current generative models and include continuous evaluation with benchmarks like FRAUDBENCH.

Caveat: FRAUDBENCH is intended for academic research to improve defenses and platform safeguards; the paper explicitly states it is not meant to facilitate fraud.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This paper introduces a benchmark and reports model/detector performance rather than making causal claims about economic outcomes, so traditional evidence strength for causal inference is not applicable. Methods Rigormedium — The authors assemble real-world review evidence, apply MLLM-assisted filtering and human annotation, synthesize fakes with six state-of-the-art image generators, and evaluate across multiple model classes and humans — a robust experimental design for a benchmark; however, the description lacks details on sample sizes, annotation protocols, inter-annotator reliability, and selection criteria that would be needed to consider methodological rigor high. SampleMultimodal dataset constructed from real-world user-review evidence in e-commerce, food-delivery, and travel-service scenarios, containing genuine damaged and undamaged images (identified via MLLM-assisted filtering plus human annotation) and synthetic 'fake-damaged' images produced by editing genuine undamaged reference photos using six state-of-the-art image editing/generation models; includes associated review text and product metadata. Themesgovernance innovation GeneralizabilityLimited domains: data restricted to e-commerce, food delivery, and travel-service contexts and may not generalize to other fraud contexts (e.g., insurance, social media)., Model coverage: synthetic attacks use six generators; results may not hold for other current or future generative models or novel attack methods., Geographic / cultural bias: source platforms, languages, and regional product types may bias image/content distributions., Synthetic vs real adversaries: fake samples are generated under controlled procedures and may not capture the full diversity of real-world adversarial behavior., Annotation and filtering choices: reliance on MLLM-assisted prefiltering and human annotators introduces potential selection and labeling biases that affect external validity., Scale and class balance: if sample sizes or class balances are limited or unreported, performance estimates may be noisy or dataset-specific.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. Other	positive	high	availability of a benchmark dataset for claim-conditioned fraudulent evidence detection	0.3
FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. Other	positive	high	coverage of real-world domains in dataset	0.3
We curated real evidence images together with their associated review and product metadata, identified genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation. Other	positive	high	label quality (genuine damaged vs undamaged) via MLLM-assisted filtering and human annotation	0.3
We synthesized fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Other	positive	high	generation of fake-damaged images via six models	0.3
Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Other	positive	high	comparative detection performance across model classes and humans	0.3
Current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Error Rate	negative	high	true positive rate (TPR) for detecting fake-damaged evidence	TPR far below the 50% baseline 0.18
Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples. Error Rate	mixed	high	detection accuracy and false positive rate of specialized detectors across generator subsets	0.18
There is a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification. Other	negative	high	reliability/robustness of AI image detectors on claim-conditioned verification	0.18
Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. Other	negative	high	coverage of existing benchmarks with respect to claim-conditioned fraudulent evidence detection	0.18