dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Yingzi Ma1, Yulong Cao2, Wenhao Ding2, Shuibai Zhang1, Yan Wang2,
Boris Ivanovic2, Ming Jiang1, Marco Pavone2,3, Chaowei Xiao2,4

1 University of Wisconsin–Madison, 2 NVIDIA, 3 Stanford University, 4 Johns Hopkins University

✉ Corresponding authors: yma382@wisc.edu, chaoweixiao@jhu.edu
Paper Code (Coming Soon)

Challenges in driving VLMs

Challenge 1: Reasoning–Action Inconsistency: the predicted trajectory often contradicts the model’s stated reasoning.

Challenge 2: Uncontrollable Generation: structured reasoning can be bypassed or corrupted by prompt-level perturbations.

Abstract


The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision–language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision–language agents (VLAs) are built upon autoregressive (AR) models.

In this paper, we observe that existing AR-based VLMs—limited by causal attention and sequential token generation—often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision–language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.

Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning–action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming ARM-based baselines with a 9% improvement in behavior–trajectory consistency and a 6% increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

Our Approach


SafeVL Reasoning Pipeline

⚙️ Unified Diffusion-based Planner
dVLM-AD starts from a structured chain-of-thought template that includes critical objects, causal explanations, future meta behavior, and a sequence of waypoints. Instead of left-to-right decoding, a diffusion denoiser iteratively refines all reasoning and action tokens jointly, conditioned on multi-camera views, navigation commands, and ego state.

🧩 Controllable Structured Reasoning
During decoding, only editable slots in the template are masked and updated, so the schema itself enforces safety and format constraints. A dynamic denoise strategy with a special reduce token allows variable-length phrases inside fixed windows, avoiding length-matching bias and preserving semantic consistency between behavior and trajectory.

🪜 Two Training Stages
Stage I aligns the diffusion backbone to the driving domain using about 145k driving-related QA pairs from existing datasets, grounding perception and prediction in realistic scenes. Stage II supervises structured reasoning–action pairs on nuScenes and WOD-E2E (23k + 30k samples), so that object detection, explanations, meta behaviors, and trajectories are learned to stay consistent.

Demonstration


BibTeX


@misc{ma2025dvlmadenhancediffusionvisionlanguagemodel,
      title={dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning}, 
      author={Yingzi Ma and Yulong Cao and Wenhao Ding and Shuibai Zhang and Yan Wang and Boris Ivanovic and Ming Jiang and Marco Pavone and Chaowei Xiao},
      year={2025},
      eprint={2512.04459},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.04459}, 
}