A new benchmark finds that AI-generated images make convincing refund fraud — current multimodal LLMs and specialized detectors routinely miss synthetic 'damaged' evidence and are inconsistent across generators, leaving e-commerce and service platforms exposed.
Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.
Summary
Main Finding
FRAUDBENCH is a new multimodal benchmark that exposes a critical gap in current detection capabilities for AI-generated refund fraud: multimodal LLMs (MLLMs) tend to be over-credulous (miss many realistic, claim-conditioned fakes) while specialized detectors are over-sensitive (catch many fakes but also falsely flag genuine damaged evidence). This “over-credulity vs. over-sensitivity” trade-off creates asymmetric economic risks for platforms—false negatives produce direct financial losses from fraud, while false positives erode customer trust and generate reputational and operational costs.
Key Points
- Purpose: Evaluate detection of AI-generated, claim-conditioned refund evidence (photos submitted with refund/claim text), not just generic real/fake classification.
- Scope: 822 real review samples across 29 categories, 7,928 images total. Of these, 2,000 real images (988 real-undamaged, 1,012 real-damaged) and 5,928 synthetic fake-damaged images generated from the undamaged images.
- Generative threat model: Six state-of-the-art image editing/generation models used with identical, image-specific prompts (examples include GPT Image 2, Nano Banana 2, Grok Imagine, Wan2.7-Image-Pro, Qwen-Image-2.0-Pro, Qwen-Image-Edit-Max).
- Evaluation targets: 11 MLLMs, 4 specialized AI-image detectors, and human annotators under identical, multimodal conditions.
- Core empirical findings:
- MLLMs: usually accept genuine damaged evidence but have very low true positive rates on many fake-damaged subsets (often below 50%).
- Specialized detectors: higher detection rates for synthetic edits but inconsistent across generators and prone to false positives on genuine damaged images.
- Humans: perform better than most MLLMs but still commit non-negligible errors—verifying highly realistic edits remains difficult for humans.
- Benchmark design emphasizes realistic, multi-view evidence and claim-conditioned verification, with five evaluation axes: input modality (single vs multi-image), contextual information (review text), multi-step reasoning, prompt sensitivity, and real-image preservation.
Data & Methods
- Data sources:
- Amazon Reviews 2023 (curated subset, 590 samples)
- Trip.com and GrabFood (230 service-oriented review samples)
- In-house captured samples (supplemental)
- Preprocessing:
- MLLM-assisted screening for relevance, privacy, corrupted files, and image–review alignment.
- Human verification for MLLM-flagged samples.
- Distribution-aware sampling to keep category diversity and multi-image proportions.
- Anonymization and multilingual retention to reflect real platform usage.
- Synthetic evidence generation:
- Extract category-specific damage patterns from real-damaged images (scratches, stains, dents, breakage, leakage, etc.).
- Use an MLLM to analyze each real-undamaged image + metadata to select a plausible damage pattern.
- Produce image-specific editing prompts; apply the same prompt across six SOTA image-editing/generation models to produce fake-damaged images.
- Generate matching fake review comments (carefully controlled to avoid text-only cues).
- Human-in-the-loop quality control to remove implausible edits, mismatch, or trivial artifacts.
- Evaluation:
- Compare MLLMs, specialized detectors, and humans across the five evaluation dimensions.
- Measure both detection accuracy on fake-damaged and false positive rate on real-damaged (real-image preservation), and sensitivity to prompt wording and multi-view evidence.
Implications for AI Economics
- Direct economic harms and asymmetric risks:
- False negatives (missed fakes) cause direct payouts, inventory replacement costs, and fraud losses—an immediate financial drain on platforms and sellers.
- False positives (flagging genuine claims) damage consumer trust, increase churn, raise customer support and appeal costs, and can harm small sellers’ reputations—indirect but persistent economic costs.
- Platforms face a classical precision–recall trade-off with uneven externalities: choosing thresholds involves weighing monetary loss against reputational and long-term market effects.
- Operational and compliance costs:
- Platforms will need more resource-intensive, multimodal verification workflows (human-in-the-loop, multi-angle requirements, higher-quality metadata), increasing operational costs per claim.
- Investment priorities: specialized detectors (higher sensitivity) vs. multimodal reasoning systems (lower false positive tendency) — both require continuous retraining as generative models evolve, raising maintenance costs.
- Market design and incentives:
- Platforms may tighten evidence standards (e.g., require time-stamped or provenance-enabled photos, in-app video, or multiple angles), which imposes friction on legitimate users and could reduce conversion or raise complaint resolution times.
- New products and pricing: insurers, fraud-detection-as-a-service, and third-party verification offerings may grow; platforms could shift liability or introduce escrow/deposit mechanisms to mitigate moral hazard.
- Adverse selection: increased verification friction or wrongful flags could drive away honest users and small sellers, concentrating activity among larger actors with more resources—affecting competition and fees.
- Policy and regulatory implications:
- Standards for provenance (digital watermarks, content provenance APIs) and liability rules could be economically significant—mandates may reduce fraud but increase compliance costs.
- Regulators may require transparency in automated dispute decisions and appeal processes to limit consumer harm from false positives.
- Research and investment priorities:
- Need for robust, cross-generator, multimodal detection methods that optimize for asymmetric loss functions (cost-sensitive detection thresholds).
- Cost–benefit analyses for deploying detection tech vs. alternative controls (escrow, insurance, manual review), to find economically efficient mixes.
- Development of benchmark-driven model-agnostic standards for evaluation, including real-image preservation metrics that capture reputational/externality costs.
- Practical recommendations for platforms and policymakers:
- Adopt multimodal evidence requirements (e.g., short in-app video or timestamped multi-angle photos) to raise attack costs.
- Use layered defenses: lightweight detectors for triage, human review for high-risk or ambiguous cases, and conservative payout policies with clear appeal paths.
- Implement cost-sensitive detection thresholds and monitor economic outcomes (fraud payouts vs. customer retention).
- Encourage or mandate provenance tools and standardized logging to deter misuse and aid verification.
- Fund or require periodic red-team evaluations against current generative models and include continuous evaluation with benchmarks like FRAUDBENCH.
Caveat: FRAUDBENCH is intended for academic research to improve defenses and platform safeguards; the paper explicitly states it is not meant to facilitate fraud.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. Other | positive | high | availability of a benchmark dataset for claim-conditioned fraudulent evidence detection |
0.3
|
| FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. Other | positive | high | coverage of real-world domains in dataset |
0.3
|
| We curated real evidence images together with their associated review and product metadata, identified genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation. Other | positive | high | label quality (genuine damaged vs undamaged) via MLLM-assisted filtering and human annotation |
0.3
|
| We synthesized fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Other | positive | high | generation of fake-damaged images via six models |
0.3
|
| Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Other | positive | high | comparative detection performance across model classes and humans |
0.3
|
| Current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Error Rate | negative | high | true positive rate (TPR) for detecting fake-damaged evidence |
TPR far below the 50% baseline
0.18
|
| Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples. Error Rate | mixed | high | detection accuracy and false positive rate of specialized detectors across generator subsets |
0.18
|
| There is a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification. Other | negative | high | reliability/robustness of AI image detectors on claim-conditioned verification |
0.18
|
| Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. Other | negative | high | coverage of existing benchmarks with respect to claim-conditioned fraudulent evidence detection |
0.18
|