BATQuant makes aggressive 4-bit MXFP4 quantization practical for large multimodal and language models, recovering up to 96% of full-precision quality by respecting hardware block granularity and compressing transform parameters; the technique promises lower inference costs on MXFP-capable accelerators but its market impact depends on hardware adoption and real-world throughput gains.
Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.
Summary
Main Finding
BATQuant is a post‑training quantization (PTQ) method tailored to microscaling floating‑point MXFP formats (especially MXFP4) that (1) prevents cross‑block outlier propagation by restricting transformations to MXFP block granularity, (2) learns affine (not strictly orthogonal) block‑wise transforms to better shape activation distributions, and (3) compresses those transforms via a Global-and‑Private Kronecker (GPK) decomposition. On large multimodal and language models (Qwen3‑VL‑8B and Qwen3‑8B) BATQuant achieves state‑of‑the‑art accuracy under aggressive MXFP quantization (e.g., W4A4KV16), recovering up to ~96.4% of full‑precision performance and >99% in less aggressive W4A8KV16 settings.
Key Points
- Problem addressed
- Existing rotation‑based PTQ methods that work well for INT4 fail on MXFP4 because global orthogonal rotations move outlier energy across MXFP blocks (block size = 32), creating new outliers and often producing bimodal activation distributions that underutilize the floating‑point grids.
- Methodological innovations
- Block‑wise Affine Transformation (BAT): P matrices are block‑diagonal with each block Pi (size g×g, g aligned to MXFP block size, e.g., 32) learned to minimize layer‑wise quantization error. Restricting transforms to blocks prevents cross‑block outlier transfer.
- Relax orthogonality: Affine (not orthogonal) transforms are learned so distribution shaping is optimized for MXFP quantization characteristics.
- Global and Private Kronecker (GPK) decomposition: each Pi is modeled as Bi ⊗ A, where A is shared (global) and Bi is block‑specific (private). This drastically reduces parameter/storage overhead compared with naive per‑block matrices.
- Block‑wise Learnable Clipping: per‑block learnable clipping thresholds (parameterized via sigmoid of α) to suppress residual outliers within blocks.
- Integration & deployment
- Weight‑side transforms are fused offline into linear layers; activation‑side transforms are applied online during inference.
- Works with standard Transformer components (MLP, self‑attention) and KV cache quantization (separate transforms for keys/values).
- Empirical results (high level)
- Near‑lossless W4A8KV16: >99% recovery vs BF16.
- Aggressive W4A4KV16: up to 96.43% recovery, outperforming baselines (QuaRot, SpinQuant, BRQ, FlatQuant, SmoothQuant, GPTQ variants).
- Robust across multimodal benchmarks (MME, OCR‑Bench, DocVQA, RealWorldQA, VLMBlind) and standard LLM reasoning/non‑reasoning suites.
- Practical calibration/training
- Small calibration sets suffice: 128 samples (self‑generated text for LLM, 128 image‑text pairs for MLLM).
- Training setup: AdamW, lr=2e‑3, cosine schedule, ~5 epochs, small batch sizes.
Data & Methods
- Target format and problem specifics
- MXFP4 (E2M1): block‑wise microscaling floating point; block size g = 32; shared UE8M0 scaling exponent per block; representable mantissa values limited (discrete set of magnitudes).
- Outliers within/among blocks dominate block scaling and cause severe quantization error if energy is not properly handled.
- Optimization objective
- For each linear layer l, learn block‑wise P (and clipping α) to minimize layer output L2 error between full‑precision output and quantized output over a small calibration set: Theta_l* = argmin_E ||F_l(X) − F̂_l(X; Theta_l)||^2.
- Block structure and GPK
- Pi ∈ R^{g×g}, P = block_diag(P1,...,Pk). To avoid storing k·g^2 params per layer, apply Kronecker structure: Pi = Bi ⊗ A, where A (size g1×g1) is global/shared, Bi (g2×g2) is private, and g = g1·g2.
- Example (paper): for hidden dim N=4096, g=32, chosen g1=8, g2=4, GPK reduces parameter count to ~2,112 vs larger baselines (paper reports >74% reduction compared to FlatQuant/Naive Kronecker).
- Computation: vectorization tricks preserve efficient matmul complexity; activation transforms XP implemented efficiently.
- Clipping
- Per block clipping bounds βmin_i = sigmoid(αmin_i)·min(x_i), βmax_i = sigmoid(αmax_i)·max(x_i); learn α via calibration.
- Experimental setup
- Models: Qwen3‑8B and Qwen3‑VL‑8B‑Instruct.
- Benchmarks: multimodal VQA/Doc/OCR/STEM and LLM non‑reasoning + reasoning (PIQA, Winogrande, GSM8K, MATH500, etc.).
- Quantization configs reported: W{bits}A{bits}KV{bits}, evaluated W4A8KV16, W4A4KV16, W4A8KV8, W4A8KV4, etc.
- Comparison baselines: RTN, QuaRot, SpinQuant, BRQ, FlatQuant, SmoothQuant, GPTQ variants. In many experiments GPTQ integration is used for weight quantizers.
- Calibration data: 128 sequences or image‑text pairs (low calibration cost).
- Training: 5 epochs, small batch size, single‑layer/per‑layer optimization (PTQ).
Implications for AI Economics
- Deployment cost reduction
- Memory & bandwidth: Aggressive MXFP quantization (W4A4KV16 and W4A8KV16) with near‑lossless accuracy reduces model parameter footprint and activation memory, lowering DRAM pressure and on‑chip memory needs—translates directly to lower per‑inference cost on MXFP‑capable accelerators.
- Compute & energy: Lower bit‑width arithmetic (MXFP4/8) reduces energy per operation and may increase throughput on supported hardware, improving energy efficiency and lowering cloud inference bills and datacenter TCO.
- Edge and real‑time applications
- By recovering performance under aggressive quantization, BATQuant makes it more feasible to deploy large LLM/MLLM capabilities on constrained devices or lower‑cost accelerators, expanding addressable markets (mobile, embedded systems, on‑prem solutions).
- Reduced need for retraining / lower engineering cost
- BATQuant is a PTQ method with small calibration needs (≈128 samples) and modest per‑layer optimization (few epochs), reducing the compute and data costs compared with full re‑training or expensive finetuning runs—this lowers operational and calendar costs for model owners updating or compressing models.
- Hardware & ecosystem considerations
- Economic benefits assume availability of MXFP‑supporting hardware. Adoption depends on next‑gen accelerator support (OCP NVFP/MXFP adoption). If hardware is widely adopted, demand for such accelerators and supporting toolchains will rise; otherwise, benefits are limited.
- BATQuant incurs some online activation transforms (runtime cost) and storage for compressed transforms (GPK), though GPK dramatically reduces storage. Deployers must weigh modest extra compute/storage against quantization gains.
- Market and competition effects
- Lower inference costs and easier edge deployment can accelerate productization of multimodal LLM features, intensifying competition and possibly lowering pricing for AI services.
- Smaller operational costs may also enable more models to be deployed concurrently (multi‑model services), increasing utility per datacenter dollar.
- Limitations & cautions (economic perspective)
- Gains are hardware‑dependent: without MXFP support, benefits are reduced.
- Additional engineering: integrating BATQuant (calibration, per‑layer transforms, GPTQ integration) introduces tooling complexity; initial engineering investment may be needed.
- Evaluation scope: results are shown for Qwen3 variants; real‑world economic impact should be validated across more architectures and production workloads.
Summary: BATQuant materially improves MXFP4 quantization robustness and efficiency, enabling aggressive FP4 deployments with small calibration budgets. For organizations and edge/accelerator vendors that can adopt MXFP‑capable hardware, BATQuant can reduce memory, compute, and energy costs of serving LLMs/MLLMs—lowering inference spend and enabling new deployment scenarios—while requiring modest additional engineering for integration and calibration.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| BATQuant recovers up to 96.43% of full-precision performance under aggressive W4A4KV16 quantization on MLLMs and LLMs. Output Quality | positive | high | Percentage of full-precision performance recovered (model quality/accuracy on multimodal and language tasks) |
96.43% of full-precision performance recovered
0.18
|
| BATQuant significantly outperforms prior post-training quantization (PTQ) methods on MXFP4 microscaling floating-point formats under aggressive quantization. Output Quality | positive | high | Task-specific accuracy/quality metrics and percent recovery relative to full-precision across multimodal benchmarks and language tasks |
0.18
|
| Rotation-based PTQ methods (designed for integer formats) fail on MXFP4 because global orthogonal rotations move outlier energy across quantization blocks, creating new outliers and often producing bimodal activations that underutilize the limited MXFP range. Output Quality | negative | medium | Activation distribution characteristics (outlier propagation, bimodality) and resulting model performance (collapse) under rotation-based PTQ |
0.11
|
| Aligning transforms to MXFP block granularity using block-wise affine transformations prevents cross-block outlier propagation and avoids the severe collapse seen with rotation-based integer quantization techniques. Output Quality | positive | medium | Activation distribution (outlier propagation) and downstream task performance / accuracy after quantization |
0.11
|
| Relaxing orthogonality constraints on transforms (i.e., using non-strictly-orthogonal transforms) improves distribution shaping and better fits activations to the limited MXFP quantization range. Output Quality | positive | medium | Quantization fit (activation distribution shape) and resulting task accuracy/quality after applying the transforms |
0.11
|
| Global and Private Kronecker (GPK) decomposition compresses transform parameters, keeping storage and runtime overhead low compared to dense per-block transforms. Organizational Efficiency | positive | medium | Storage footprint and runtime overhead of transform parameterization (memory and compute overhead for transforms) |
0.11
|
| Block-wise learnable clipping suppresses residual outliers locally and contributes to robustness under aggressive MXFP4 quantization. Output Quality | positive | medium | Residual outlier statistics and downstream task performance after applying learnable clipping |
0.11
|
| Ablation analyses show that each BATQuant component (block-wise transforms, orthogonality relaxation, GPK decomposition, block-wise clipping) contributes to robustness and efficiency. Output Quality | positive | medium | Task performance (accuracy/quality) and efficiency metrics (storage/runtime) with and without each component |
0.11
|
| BATQuant establishes new state-of-the-art results across multimodal benchmarks for MXFP4-aware PTQ under aggressive quantization. Output Quality | positive | medium | Benchmark performance (accuracy/quality) on multimodal tasks relative to prior PTQ methods |
0.11
|
| Deploying BATQuant with reliable 4-bit weight/activation quantization for MXFP-capable accelerators reduces memory footprint and memory-bandwidth pressure, enabling higher throughput and lower per-token inference costs. Firm Productivity | positive | low | Inferred system-level outcomes: memory footprint, memory-bandwidth usage, throughput, and per-token inference cost (not directly measured in production in the summary) |
0.05
|
| Practical caveats: benefits depend on accelerators supporting MXFP formats; despite up to 96% recovery, residual quality gaps may remain for some task-specific or safety-critical cases; integration and tuning cost is required to apply BATQuant. Adoption Rate | mixed | high | Dependency on hardware support (binary), residual accuracy gap relative to full-precision for specific tasks, and engineering effort (qualitative) |
0.18
|