Utility-Aware Multimodal Contrastive Learning for Product Image Generation

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \textit{utility-aware multimodal contrastive learning} framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

Summary

Main Finding

Incorporating estimated marketplace demand directly into multimodal contrastive learning (via a Utility-Aware InfoNCE loss) produces image representations that steer text-to-image generation toward images that both preserve semantic fidelity and increase downstream demand. Models fine-tuned with this Utility-Aware CLIP (and used in a Utility-Aware Generator) outperform state-of-the-art baselines on Amazon and Airbnb tasks (generation and editing), maintain text–image consistency and realism, and preserve empirically observed inverted-U demand relationships (e.g., for aesthetics and uniqueness). Human-subject choice experiments confirm improved purchase/booking likelihood for the generated images.

Key Points

Problem addressed: standard CLIP-based generation optimizes semantic alignment and aesthetics, not marketplace outcomes; this leads to images that may look good but do not maximize sales/bookings.
Core idea: augment CLIP similarity with a demand-driven visual utility term hv(v) to form a Utility-Aware CLIP score:
- USθ,w(v,t) = αv · hv(v) + βs · sθ(v,t)
- Used in a Utility-Aware InfoNCE loss (bidirectional) with temperature τ.
Theoretical interpretation:
- InfoNCE = negative log-likelihood of a conditional logit; adding hv(v) corresponds to adding observable utility features into that discrete-choice utility.
- Authors derive mutual-information-style bounds and provide occlusion-based interpretability showing representations attend more to demand-relevant visual cues while retaining semantic alignment.
Implementation:
- Start from pretrained CLIP encoders (image encoder example: ResNet-50; text encoder: DistilBERT), project to shared embedding space, and fine-tune with Utility-Aware InfoNCE.
- hv(v) is constructed from an estimated demand model (weighted index of domain-specific visual features like aesthetics, uniqueness, lighting, colorfulness, etc.).
- The Utility-Aware CLIP is plugged into a text-to-image pipeline (Utility-Aware Generator) for both image generation and editing.
Empirical results:
- Applications on Amazon (product images) and Airbnb (listing photos) show consistent gains on demand-based scores and related metrics versus baselines (GPT-Image, Stable Diffusion, Flux).
- The approach improves demand while preserving product identity and realism; avoids over-stylized outputs typical of general-purpose generators.
- Preserves inverted-U patterns (moderate aesthetics/uniqueness are best), i.e., steers images toward optimal points rather than simply maximizing a single aesthetic score.
- Human-subject experiments (choice tasks for likely-purchase / likely-book) validate commercial effectiveness.
Practical claim: the method is modular and can be embedded into emerging generative systems to make them marketplace-aware.

Data & Methods

Data sources:
- Two empirical domains: Amazon product listings and Airbnb property listings (images paired with textual descriptions and observed marketplace outcomes: purchases/bookings).
- Visual features for hv(v) are derived from interpretable computer-vision measures (aesthetics, uniqueness, brightness/colorfulness, warmth, etc.) estimated to predict demand in each context.
Model architecture and objective:
- Base encoders: pretrained CLIP components (image encoder such as ResNet-50; text encoder such as DistilBERT) with projection heads to a shared embedding space; cosine similarity sθ(v,t) used.
- Utility-Aware similarity: linear combination of demand-driven scalar hv(v) and CLIP similarity, weighted by αv and βs.
- Loss: Utility-Aware InfoNCE (bidirectional: text→image and image→text), i.e., negative log-softmax over utility-aware scores across sampled negatives; temperature τ used.
- Training: fine-tune pretrained CLIP weights (not training from scratch) to obtain Utility-Aware CLIP.
Generation/editing pipeline:
- Utility-Aware CLIP is used to guide a text-to-image model (named Utility-Aware Generator). Base generator for comparisons includes Flux (open-source SOTA), GPT-Image, Stable Diffusion.
- Both full-generation (from prompt) and targeted editing (modify existing image while preserving identity/realism) evaluated.
Evaluation:
- Quantitative: demand-based metrics (predicted purchase/booking likelihood using estimated demand model), fidelity and realism metrics, semantic consistency (CLIP-like measures), and specific checks for preserving inverted-U attribute relationships.
- Baselines: Flux, Stable Diffusion, GPT-Image, and an unmodified CLIP-guided generator.
- Human-subjects: preference/choice experiments where participants select images they'd be most likely to purchase or book, used to validate economic effect.
Interpretability & theory:
- Discrete-choice interpretation: InfoNCE ↔ conditional logit; utility term maps naturally into this framework.
- Mutual information bounds for the Utility-Aware InfoNCE are provided to clarify how the objective preserves semantic alignment while embedding demand signals.
- Occlusion-based attribution analyses show the fine-tuned model attends more to demand-relevant visual regions/features.

Implications for AI Economics

Aligning generative models with economic objectives:
- Demonstrates a practical mechanism to move generative AI from "style imitation" toward direct commercial value by embedding demand estimates into learning objectives.
- Opens a path for platform- and seller-level automation: automatic suggestion or creation of secondary images that are more likely to convert, reducing reliance on prompt engineering and professional photography.
Platform strategy and product design:
- Platforms can integrate such modules to boost marketplace performance (recommend image edits, A/B test counterfactual images, surface low-quality listings).
- Sellers with limited resources can gain conversion benefits at scale; implications for competition (smaller sellers may be able to match the visual quality of larger sellers).
Measurement and evaluation:
- Suggests complementing standard semantic/fidelity metrics (e.g., CLIP score) with utility-aware evaluation aligned to business KPIs (clicks, conversions, bookings).
- Shows the value of interpretable visual features for connecting model outputs to economic outcomes.
Economic dynamics and risks:
- Potential for homogenization: if many sellers follow demand-optimal cues, visual differentiation may compress; however, the model can preserve inverted-U preferences (so extreme homogenization may be limited).
- Incentive effects: platforms might prioritize listings edited by their own utility-aware tools, raising questions about platform neutrality and fairness.
- Overfitting to historical demand: models depend on accurate demand estimation; if preferences shift or data reflect biases, generated images could misalign with future consumer tastes or perpetuate biases.
Privacy, regulation, and trust:
- Using observed marketplace behavior to guide content raises privacy and transparency considerations (how demand models are trained and applied).
- Trust and authenticity trade-offs: the method claims to preserve realism, but economic incentives might push toward subtly manipulative edits—regulatory oversight or disclosure policies may be needed.
Research and commercialization directions:
- The modular utility-aware component can be generalized to other objective functions (lifetime value, retention, reviews) and to personalization (user-segment-specific hv(v)).
- Future work should study dynamic effects (how generated images affect long-run demand, returns, and platform welfare) and cross-category generalization.
Overall: embedding econometric demand signals into multimodal representation learning is a promising route to make generative AI directly actionable for market outcomes, but it requires careful handling of incentives, model robustness, fairness, and transparency.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper measures direct marketplace outcomes (demand metrics) and reports improvements across two major platforms plus human-subject validation and a theoretical bound—strong indicators of practical effect. However, the abstract does not document large-scale randomized field experiments on live listings, report sample sizes or robustness checks, nor fully rule out platform confounding (e.g., listing heterogeneity, recommendation algorithms), so causal claims are plausible but not fully established at the highest standard. Methods Rigormedium — The work combines a clear methodological innovation (utility-aware loss), theoretical analysis, cross-platform empirical tests, and human-subject experiments, which indicates solid methodological effort. Missing/unclear elements in the abstract include details on randomization procedures in marketplace tests, sample sizes, pre-registration, controls for listing and temporal confounders, and sensitivity/heterogeneity analyses—limiting assessment of rigor. SampleDownstream evaluations on Amazon and Airbnb product/listing datasets (images and text prompts) comparing generated/edited images from the utility-aware model versus state-of-the-art generative models; metrics reported include marketplace demand outcomes (e.g., clicks, conversions, bookings or other platform engagement measures) and fidelity/semantic-consistency measures; additional randomized human-subject experiments to validate commercial effectiveness. The abstract does not provide sample sizes, category breakdowns, geographic coverage, or full details on which demand metrics were used. Themesadoption innovation IdentificationComparison of outcomes for images generated/edited by the utility-aware model versus state-of-the-art baselines in downstream marketplace tasks (Amazon, Airbnb), supplemented by randomized human-subject experiments; theoretical argument showing how the Utility-Aware InfoNCE loss shifts representation space toward demand-relevant cues. No clear statement of large-scale randomized field A/B tests on live platforms in the abstract, so identification relies on model-to-model comparisons, platform-specific performance metrics, and lab-style randomization for human subjects. GeneralizabilityResults may be platform-specific (Amazon and Airbnb) and not generalize to other marketplaces with different consumer behavior or ranking algorithms, Performance could vary across product categories, price tiers, and cultural/geographic markets not represented in the sample, Human-subject lab results may not fully translate to real-world marketplace dynamics (external validity), Relies on quality and structure of textual prompts; different prompt practices could change outcomes, Dependent on the underlying generative model architecture and training data—results may not transfer to future or proprietary models

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Product images strongly influence consumer decision-making in online marketplaces. Consumer Welfare	positive	high	consumer decision-making	0.48
Existing generative AI models do not directly optimize marketplace performance. Firm Revenue	negative	high	marketplace performance (sales / demand)	0.48
We propose a utility-aware multimodal contrastive learning framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Other	positive	high	incorporation of consumer demand into representation learning (method-level outcome)	0.08
Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. Adoption Rate	positive	high	semantic coherence and demand (sales/engagement)	0.48
The effect arises from a shift in the learned image-text representation space toward demand-driven visual cues, which we validate through a theoretical bound on the proposed objective. Other	positive	high	shift in image-text representation toward demand-driven visual cues	0.48
In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Adoption Rate	positive	high	increase in demand; image fidelity; text-image consistency	0.48
The utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Adoption Rate	mixed	high	demand pattern (inverse U-shaped) across attribute values like aesthetics and uniqueness; demand performance and fidelity	0.48
Human-subject experiments further validate the commercial effectiveness of the utility-aware method. Adoption Rate	positive	high	commercial effectiveness (presumably purchase intent / preference in human-subject tests)	0.48
The utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use. Adoption Rate	positive	medium	ability to embed component into generative models to improve commercial outcomes	0.05
Multimodal contrastive learning enables generative AI to output images that closely align with text prompts. Output Quality	positive	high	text-image alignment (semantic coherence)	0.48

A utility-aware contrastive loss steers generative images to sell better: applying the loss to image-generation models raises measured demand on Amazon and Airbnb versus state-of-the-art baselines while keeping images faithful to prompts and visually high-quality.