CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning

WACV 2026
Zeyuan Chen 1    Xiang Zhang 1    Haiyang Xu 1    Jianwen Xie 2    Zhuowen Tu 1   
1 UC San Diego      2 Lambda, Inc.
CVP teaser

Human vision combines central vision for focused, high-acuity perception and peripheral vision for broader contextual awareness. Our proposed model, CVP, mimics this dual process with a target-affinity token that guides attention to target-relevant objects and regions, and an allocentric grid that captures allocentric spatial context. Experimental results show that CVP achieves state-of-the-art results across multiple 3D scene understanding benchmarks.

Abstract

We present a central-peripheral vision-inspired framework (CVP), a simple yet effective multimodal model for spatial reasoning that draws inspiration from the two types of human visual fields -- central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model's attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.

Method Overview

pipeline overview

Illustration of CVP. Given visual tokens from multi-view images with 3D positional embeddings and a user question as input, we (1) incorporate a text-based allocentric grid to provide allocentric global scene context; and (2) introduce a special target-affinity token that guides the model to focus on target-related objects. During output generation, in addition to producing a language response, the representation of the target-affinity token is passed through an MLP and optimized with a contrastive loss against 3D object embeddings back-projected from multi-view 2D features. Positive samples correspond to ground-truth objects relevant to the question, while negatives are irrelevant. This contrastive supervision helps the model attend more effectively to semantically relevant targets.

Qualitative Results

3D dense captioning comparison

3D Dense Captioning: The model is tasked to generate descriptions for a specific object in the 3D scene. The baseline Video-3D-LLM is drawn to the larger, more visually salient sink and fails to describe the actual target object, our target-affinity token steers the model to the soap dispenser itself, enabling precise localization and an object-specific caption.

3D visual grounding comparison

3D Visual Grounding: The model is tasked to locate an object in the 3D scene given a description. In this example, the description involves multiple spatial relationships, such as “to the left of” and “closest to the left side of the room.”. The target-affinity token in CVP successfully retrieves the objects based on the question.

3D question answering with allocentric grid

3D Question Answering: The allocentric grid is helpful to CVP for answering the question that requires reasoning over both proximity and relative position within the scene.

Quantitative Results

Method ScanRefer Multi3DRefer
Acc@0.25 (↑) Acc@0.5 (↑) F1@0.25 (↑) F1@0.5 (↑)
Expert models
ScanRefer37.324.3
MVT40.833.3
ViL3DRel47.937.7
BUTD-DETR52.239.8
3DVG-Trans45.934.525.5
3DJCG49.637.326.6
M3DRef-CLIP51.044.742.838.4
3D LMMs
3D-LLM30.3
Chat-3D v235.930.4
Grounded 3D-LLM47.944.145.240.6
Chat-Scene55.550.257.152.4
LLaVA-3D50.142.749.843.6
Video-3D-LLM58.151.758.052.7
CVP (Ours)62.055.460.254.7

Quantitative comparisons with SOTA models for 3D Visual Grounding on ScanRefer and Multi3DRefer.

BibTeX

If you find our paper useful for your research or applications, please consider citing it using this BibTex:

@InProceedings{chen2025cvp,
        title={CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning},
        author={Chen, Zeyuan and Zhang, Xiang and Xu, Haiyang and Xie, Jianwen and Tu, Zhuowen},
        booktitle = {WACV},
        year      = {2026},
}