Evaluation metrics push image generators toward brighter, exaggerated colors; a new 1.3M‑image benchmark, a learned color‑fidelity metric, and a training‑free refinement method restore authentic color and let developers improve photorealism without costly retraining.

Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei · March 11, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

The paper shows that existing evaluators bias T2I models toward over-vivid colors, and introduces a 1.3M+ image benchmark (CFD), a learned Color Fidelity Metric (CFM) that localizes color errors, and a training-free Color Fidelity Refinement (CFR) that uses CFM to produce more photorealistic colors at inference time.

Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.

Summary

Main Finding

The authors identify a systematic bias in current T2I evaluation: human ratings and preference-trained metrics reward visually vivid but exaggerated color/contrast, which makes outputs less photorealistic when realism is the goal. They introduce the Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) to measure color realism objectively, and a training-free Color Fidelity Refinement (CFR) procedure that uses CFM’s learned attention to adaptively modulate spatial-temporal guidance during generation, improving perceived color authenticity in realistic-style T2I outputs.

Repository: https://github.com/ZhengyaoFang/CFM

Key Points

Problem: Existing evaluators favor vividness (high saturation/contrast), producing images that look “too vivid” to be real when realism is requested.
CFD: A large-scale dataset (over 1.3M images) containing both real and synthetic images organized with ordered levels of color realism to support objective evaluation.
CFM: A multimodal encoder-based metric trained to capture human perceptual notions of color fidelity, producing an attention map that identifies where color fidelity errors occur.
CFR: A training-free refinement method that uses CFM attention to adaptively adjust spatial-temporal guidance scales during the generation process, improving color authenticity without retraining the base T2I model.
Pipeline: CFD trains/validates CFM; CFM scores and localizes color fidelity issues; CFR uses that localization to refine generations — a closed loop for assessing and improving color fidelity.
Availability: Dataset and code are publicly released.

Data & Methods

Dataset (CFD)
- Size: >1.3 million images.
- Content: Mixture of real photographs and synthetic T2I outputs with annotated/ordered levels of color realism (supports relative judgments of fidelity).
- Purpose: Provide ground-truth signal for color realism distinct from general image quality or “vividness.”
Metric (CFM)
- Architecture: Multimodal encoder (likely similar in spirit to image-text encoders) trained to predict perceptual color fidelity; produces both scalar fidelity scores and spatial attention maps.
- Training signal: Uses CFD’s ordered realism labels to learn human-consistent judgments of color authenticity.
Refinement (CFR)
- Approach: Training-free; during generation it adaptively modulates spatial-temporal guidance scale (i.e., scales that control conditioning strength across space and time/denoising steps).
- Guidance: Modulation is driven by CFM’s attention — locations flagged as low color fidelity receive adjusted guidance to steer generation toward more authentic color.
- Benefit: Improves color authenticity without re-training the T2I model, reducing compute and development cost.
Evaluation: Empirical results show improved alignment with color realism judgments compared to preference-trained metrics and human ratings that favor vividness (paper reports qualitative and quantitative gains; see repo/paper for numbers).

Implications for AI Economics

Benchmark design shapes incentives: Preference-trained or subjective metrics that reward vividness incentivize models to produce stylized, less authentic outputs. Introducing objective, task-specific metrics (like CFM) can realign R&D toward desired product attributes (photorealism vs. stylization).
Product differentiation and competition: Public CFD/CFM lowers barriers for firms to measure and improve color fidelity. Smaller teams can use CFR (training-free) to enhance outputs without costly retraining, potentially intensifying competition in photorealistic image services and reducing entry costs.
Cost and speed trade-offs: CFR provides a low-cost way (no retraining) to improve fidelity, shifting product development trade-offs away from expensive model retraining toward smarter inference-time techniques. This can reduce time-to-market for higher-fidelity features.
Labor and service markets: More authentic T2I outputs reduce the need for manual color correction and retouching, potentially displacing some editing jobs but also creating demand for higher-level creative roles (direction, curation). Platforms offering automated photorealistic generation could capture value previously held by retouching services.
Monetization and valuation: Better alignment between generated imagery and user expectations of realism can increase commercial utility in advertising, e-commerce, and stock photography, changing pricing power and business models for image providers.
Externalities and policy risks: Improving photorealism raises risks around misinformation and undetectable manipulatives (deepfakes). Objective fidelity metrics and refinement techniques make synthetic images harder to distinguish from real ones, increasing need for provenance, watermarking, and regulatory scrutiny.
Benchmark gaming and market signaling: As industry adopts CFD/CFM, firms may optimize for that metric. Economists and policymakers should watch for metric overfitting and consider multiple, complementary metrics (e.g., provenance, semantic fidelity).
Research & public-good effects: Public release of CFD and CFM can accelerate research on honest evaluation and reduction of evaluation-induced biases, a public-good that affects downstream market structure and innovation trajectories.

If you want, I can: - Extract potential empirical questions an economist could study using CFD/CFM (e.g., effect of fidelity on willingness-to-pay). - Draft policy recommendations for managing photorealistic synthesis risk (watermarking, mandatory provenance standards).

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is a technical ML paper introducing a dataset, a learned metric, and a refinement procedure, not an empirical economics study seeking causal identification; its empirical claims concern method performance (evaluations and human studies) rather than causal effects on economic outcomes. Methods Rigorhigh — The authors construct a very large dataset (>1.3M images) with ordered color-realism labels, train a multimodal encoder to predict perceptual color fidelity and produce attention maps, compare against preference-trained metrics, and validate improvements with human judgments; they also release code and data which supports reproducibility. Remaining limitations include potential dataset labeling biases, limited reporting (in the summary) on annotation procedures and inter-rater reliability, possible model-dependence of attention localization, and the need for broader robustness checks across many T2I architectures and domains. SampleCFD comprises over 1.3 million images mixing real photographs and synthetic text-to-image outputs, annotated or organized into ordered levels of color realism to provide a ground truth signal for color fidelity; CFM is trained on CFD to output scalar fidelity scores and spatial attention maps, and CFR is evaluated by applying CFM-guided inference refinements to base T2I models with human and metric-based comparisons. Themesadoption innovation labor_markets GeneralizabilityConstructed and validated primarily for photorealistic style; does not target artistic/stylized outputs where vividness may be desirable, Performance and attention localization may depend on the T2I architectures and training data used to generate the synthetic examples, CFD labeling and perceived fidelity may reflect cultural or scene-content biases in the dataset and annotator pool, Focuses only on color fidelity; other realism dimensions (geometry, lighting, semantics) are not addressed, May not generalize to domain-specific applications (medical, satellite imagery) without additional calibration

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Human ratings and preference-trained metrics reward visually vivid but exaggerated color and contrast, which leads to outputs that are less photorealistic when photorealism is the intended objective. Output Quality	negative	medium	perceived photorealism / alignment with color realism (human preference and preference-trained metric scores vs. color-fidelity ground truth)	preference-trained metrics/human ratings favor vividness over photorealism (reported comparisons) 0.02
The Color Fidelity Dataset (CFD) is a large-scale dataset of over 1.3 million images containing both real photographs and synthetic T2I outputs, organized with ordered levels of color realism to support objective evaluation. Other	positive	high	dataset size and composition; presence of ordered color-realism labels enabling relative fidelity judgments	n=1300000 0.03
The Color Fidelity Metric (CFM) is a multimodal encoder–based metric trained on CFD to predict human-consistent judgments of color fidelity and to produce spatial attention maps that localize color-fidelity errors. Output Quality	positive	medium	color-fidelity scalar scores and spatial attention maps (localization of color errors)	CFM produces scalar fidelity scores and spatial attention maps (reported) 0.02
CFM aligns better with objective color realism judgments than existing preference-trained metrics and human ratings that favor vividness. Output Quality	positive	medium	alignment with color-realism judgments / correlation with CFD ground truth	CFM shows improved alignment with color-realism labels vs. baselines (reported) 0.02
Color Fidelity Refinement (CFR) is a training-free inference-time procedure that uses CFM attention maps to adaptively modulate spatial-temporal guidance scales during generation, thereby improving color authenticity of realistic-style T2I outputs without retraining the base model. Output Quality	positive	medium	perceived color authenticity of generated images; requirement (or not) to retrain base T2I models	improves perceived color authenticity at inference without retraining (reported) 0.02
The proposed pipeline (CFD -> CFM -> CFR) forms a closed loop that can assess and improve color fidelity in T2I systems. Output Quality	positive	medium	end-to-end improvement in measured color fidelity when applying CFD-trained CFM and CFR to T2I generation	end-to-end measured improvement in color fidelity when applying CFD->CFM->CFR (reported) 0.02
Dataset and code (CFD, CFM, CFR) are publicly released. Other	positive	high	public availability of dataset and code	public release of dataset and code (reported) 0.03
Using CFR avoids the computational and development costs of retraining T2I models to improve color fidelity, providing a lower-cost path to better color authenticity. Organizational Efficiency	positive	low	compute/development cost required to improve color fidelity (inference-only CFR vs. retraining)	avoids retraining costs (qualitative cost advantage) 0.01
Improving photorealism with objective color-fidelity metrics and refinement reduces the need for manual color correction and retouching in downstream workflows. Job Displacement	negative	speculative	demand for manual color correction / retouching services	reduced need for manual color correction/retouching (projected demand reduction) 0.0