BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

Summary

Main Finding

BATQuant is a post‑training quantization (PTQ) method tailored to microscaling floating‑point MXFP formats (especially MXFP4) that (1) prevents cross‑block outlier propagation by restricting transformations to MXFP block granularity, (2) learns affine (not strictly orthogonal) block‑wise transforms to better shape activation distributions, and (3) compresses those transforms via a Global-and‑Private Kronecker (GPK) decomposition. On large multimodal and language models (Qwen3‑VL‑8B and Qwen3‑8B) BATQuant achieves state‑of‑the‑art accuracy under aggressive MXFP quantization (e.g., W4A4KV16), recovering up to ~96.4% of full‑precision performance and >99% in less aggressive W4A8KV16 settings.

Key Points

Problem addressed
- Existing rotation‑based PTQ methods that work well for INT4 fail on MXFP4 because global orthogonal rotations move outlier energy across MXFP blocks (block size = 32), creating new outliers and often producing bimodal activation distributions that underutilize the floating‑point grids.
Methodological innovations
- Block‑wise Affine Transformation (BAT): P matrices are block‑diagonal with each block Pi (size g×g, g aligned to MXFP block size, e.g., 32) learned to minimize layer‑wise quantization error. Restricting transforms to blocks prevents cross‑block outlier transfer.
- Relax orthogonality: Affine (not orthogonal) transforms are learned so distribution shaping is optimized for MXFP quantization characteristics.
- Global and Private Kronecker (GPK) decomposition: each Pi is modeled as Bi ⊗ A, where A is shared (global) and Bi is block‑specific (private). This drastically reduces parameter/storage overhead compared with naive per‑block matrices.
- Block‑wise Learnable Clipping: per‑block learnable clipping thresholds (parameterized via sigmoid of α) to suppress residual outliers within blocks.
Integration & deployment
- Weight‑side transforms are fused offline into linear layers; activation‑side transforms are applied online during inference.
- Works with standard Transformer components (MLP, self‑attention) and KV cache quantization (separate transforms for keys/values).
Empirical results (high level)
- Near‑lossless W4A8KV16: >99% recovery vs BF16.
- Aggressive W4A4KV16: up to 96.43% recovery, outperforming baselines (QuaRot, SpinQuant, BRQ, FlatQuant, SmoothQuant, GPTQ variants).
- Robust across multimodal benchmarks (MME, OCR‑Bench, DocVQA, RealWorldQA, VLMBlind) and standard LLM reasoning/non‑reasoning suites.
Practical calibration/training
- Small calibration sets suffice: 128 samples (self‑generated text for LLM, 128 image‑text pairs for MLLM).
- Training setup: AdamW, lr=2e‑3, cosine schedule, ~5 epochs, small batch sizes.

Data & Methods

Target format and problem specifics
- MXFP4 (E2M1): block‑wise microscaling floating point; block size g = 32; shared UE8M0 scaling exponent per block; representable mantissa values limited (discrete set of magnitudes).
- Outliers within/among blocks dominate block scaling and cause severe quantization error if energy is not properly handled.
Optimization objective
- For each linear layer l, learn block‑wise P (and clipping α) to minimize layer output L2 error between full‑precision output and quantized output over a small calibration set: Theta_l* = argmin_E ||F_l(X) − F̂_l(X; Theta_l)||^2.
Block structure and GPK
- Pi ∈ R^{g×g}, P = block_diag(P1,...,Pk). To avoid storing k·g^2 params per layer, apply Kronecker structure: Pi = Bi ⊗ A, where A (size g1×g1) is global/shared, Bi (g2×g2) is private, and g = g1·g2.
- Example (paper): for hidden dim N=4096, g=32, chosen g1=8, g2=4, GPK reduces parameter count to ~2,112 vs larger baselines (paper reports >74% reduction compared to FlatQuant/Naive Kronecker).
- Computation: vectorization tricks preserve efficient matmul complexity; activation transforms XP implemented efficiently.
Clipping
- Per block clipping bounds βmin_i = sigmoid(αmin_i)·min(x_i), βmax_i = sigmoid(αmax_i)·max(x_i); learn α via calibration.
Experimental setup
- Models: Qwen3‑8B and Qwen3‑VL‑8B‑Instruct.
- Benchmarks: multimodal VQA/Doc/OCR/STEM and LLM non‑reasoning + reasoning (PIQA, Winogrande, GSM8K, MATH500, etc.).
- Quantization configs reported: W{bits}A{bits}KV{bits}, evaluated W4A8KV16, W4A4KV16, W4A8KV8, W4A8KV4, etc.
- Comparison baselines: RTN, QuaRot, SpinQuant, BRQ, FlatQuant, SmoothQuant, GPTQ variants. In many experiments GPTQ integration is used for weight quantizers.
- Calibration data: 128 sequences or image‑text pairs (low calibration cost).
- Training: 5 epochs, small batch size, single‑layer/per‑layer optimization (PTQ).

Implications for AI Economics

Deployment cost reduction
- Memory & bandwidth: Aggressive MXFP quantization (W4A4KV16 and W4A8KV16) with near‑lossless accuracy reduces model parameter footprint and activation memory, lowering DRAM pressure and on‑chip memory needs—translates directly to lower per‑inference cost on MXFP‑capable accelerators.
- Compute & energy: Lower bit‑width arithmetic (MXFP4/8) reduces energy per operation and may increase throughput on supported hardware, improving energy efficiency and lowering cloud inference bills and datacenter TCO.
Edge and real‑time applications
- By recovering performance under aggressive quantization, BATQuant makes it more feasible to deploy large LLM/MLLM capabilities on constrained devices or lower‑cost accelerators, expanding addressable markets (mobile, embedded systems, on‑prem solutions).
Reduced need for retraining / lower engineering cost
- BATQuant is a PTQ method with small calibration needs (≈128 samples) and modest per‑layer optimization (few epochs), reducing the compute and data costs compared with full re‑training or expensive finetuning runs—this lowers operational and calendar costs for model owners updating or compressing models.
Hardware & ecosystem considerations
- Economic benefits assume availability of MXFP‑supporting hardware. Adoption depends on next‑gen accelerator support (OCP NVFP/MXFP adoption). If hardware is widely adopted, demand for such accelerators and supporting toolchains will rise; otherwise, benefits are limited.
- BATQuant incurs some online activation transforms (runtime cost) and storage for compressed transforms (GPK), though GPK dramatically reduces storage. Deployers must weigh modest extra compute/storage against quantization gains.
Market and competition effects
- Lower inference costs and easier edge deployment can accelerate productization of multimodal LLM features, intensifying competition and possibly lowering pricing for AI services.
- Smaller operational costs may also enable more models to be deployed concurrently (multi‑model services), increasing utility per datacenter dollar.
Limitations & cautions (economic perspective)
- Gains are hardware‑dependent: without MXFP support, benefits are reduced.
- Additional engineering: integrating BATQuant (calibration, per‑layer transforms, GPTQ integration) introduces tooling complexity; initial engineering investment may be needed.
- Evaluation scope: results are shown for Qwen3 variants; real‑world economic impact should be validated across more architectures and production workloads.

Summary: BATQuant materially improves MXFP4 quantization robustness and efficiency, enabling aggressive FP4 deployments with small calibration budgets. For organizations and edge/accelerator vendors that can adopt MXFP‑capable hardware, BATQuant can reduce memory, compute, and energy costs of serving LLMs/MLLMs—lowering inference spend and enabling new deployment scenarios—while requiring modest additional engineering for integration and calibration.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides extensive empirical evaluations (multiple MLLMs and LLMs), head-to-head comparisons with prior PTQ baselines, and ablation studies that support the method-level claims; however, evidence is limited to model-level performance recovery under a specific MXFP4/W4A4KV16 quantization regime and lacks end-to-end hardware cost/throughput measurements, broad cross-hardware validation, and public replication assets. Methods Rigormedium — Methodologically sound within ML research norms: clear algorithmic contributions (block-wise affine transforms, orthogonality relaxation, GPK decomposition, learnable clipping), baseline comparisons, and ablations; but missing or unclear elements reduce rigor: potential lack of diverse hardware experiments, limited disclosure of hyperparameters and tuning sensitivity for each model, and no direct measurement of inference latency/energy or production integration costs. SamplePost-Training Quantization experiments on several multimodal large language models (MLLMs) and standard LLMs evaluated on multimodal benchmarks and language tasks under aggressive W4A4KV16 quantization; comparisons to rotation-based PTQ baselines and other PTQ methods; ablation studies isolating block-wise transforms, GPK compression, and block-wise clipping; reported metric is percentage of full-precision performance recovered (e.g., up to 96.43%). Themesadoption productivity GeneralizabilityDepends on MXFP4-compatible accelerators — benefits require hardware that supports MXFP block formats, Evaluated on a set of MLLMs/LLMs and benchmarks; may not generalize to all architectures (e.g., very different layer/block tilings or small models), Results tied to a specific aggressive quantization configuration (W4A4KV16); other quantization schemes may behave differently, Reported metric is model-quality recovery; real-world throughput, latency, energy, and cost gains on production hardware are not directly measured, Potential sensitivity to per-model tuning and implementation details that could limit out-of-the-box adoption

Claims (11)

Claim	Direction	Confidence	Outcome	Details
BATQuant recovers up to 96.43% of full-precision performance under aggressive W4A4KV16 quantization on MLLMs and LLMs. Output Quality	positive	high	Percentage of full-precision performance recovered (model quality/accuracy on multimodal and language tasks)	96.43% of full-precision performance recovered 0.18
BATQuant significantly outperforms prior post-training quantization (PTQ) methods on MXFP4 microscaling floating-point formats under aggressive quantization. Output Quality	positive	high	Task-specific accuracy/quality metrics and percent recovery relative to full-precision across multimodal benchmarks and language tasks	0.18
Rotation-based PTQ methods (designed for integer formats) fail on MXFP4 because global orthogonal rotations move outlier energy across quantization blocks, creating new outliers and often producing bimodal activations that underutilize the limited MXFP range. Output Quality	negative	medium	Activation distribution characteristics (outlier propagation, bimodality) and resulting model performance (collapse) under rotation-based PTQ	0.11
Aligning transforms to MXFP block granularity using block-wise affine transformations prevents cross-block outlier propagation and avoids the severe collapse seen with rotation-based integer quantization techniques. Output Quality	positive	medium	Activation distribution (outlier propagation) and downstream task performance / accuracy after quantization	0.11
Relaxing orthogonality constraints on transforms (i.e., using non-strictly-orthogonal transforms) improves distribution shaping and better fits activations to the limited MXFP quantization range. Output Quality	positive	medium	Quantization fit (activation distribution shape) and resulting task accuracy/quality after applying the transforms	0.11
Global and Private Kronecker (GPK) decomposition compresses transform parameters, keeping storage and runtime overhead low compared to dense per-block transforms. Organizational Efficiency	positive	medium	Storage footprint and runtime overhead of transform parameterization (memory and compute overhead for transforms)	0.11
Block-wise learnable clipping suppresses residual outliers locally and contributes to robustness under aggressive MXFP4 quantization. Output Quality	positive	medium	Residual outlier statistics and downstream task performance after applying learnable clipping	0.11
Ablation analyses show that each BATQuant component (block-wise transforms, orthogonality relaxation, GPK decomposition, block-wise clipping) contributes to robustness and efficiency. Output Quality	positive	medium	Task performance (accuracy/quality) and efficiency metrics (storage/runtime) with and without each component	0.11
BATQuant establishes new state-of-the-art results across multimodal benchmarks for MXFP4-aware PTQ under aggressive quantization. Output Quality	positive	medium	Benchmark performance (accuracy/quality) on multimodal tasks relative to prior PTQ methods	0.11
Deploying BATQuant with reliable 4-bit weight/activation quantization for MXFP-capable accelerators reduces memory footprint and memory-bandwidth pressure, enabling higher throughput and lower per-token inference costs. Firm Productivity	positive	low	Inferred system-level outcomes: memory footprint, memory-bandwidth usage, throughput, and per-token inference cost (not directly measured in production in the summary)	0.05
Practical caveats: benefits depend on accelerators supporting MXFP formats; despite up to 96% recovery, residual quality gaps may remain for some task-specific or safety-critical cases; integration and tuning cost is required to apply BATQuant. Adoption Rate	mixed	high	Dependency on hardware support (binary), residual accuracy gap relative to full-precision for specific tasks, and engineering effort (qualitative)	0.18