CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning

WACV 2026

¹ UC San Diego ² Lambda, Inc.

Human vision combines central vision for focused, high-acuity perception and peripheral vision for broader contextual awareness. Our proposed model, CVP, mimics this dual process with a target-affinity token that guides attention to target-relevant objects and regions, and an allocentric grid that captures allocentric spatial context. Experimental results show that CVP achieves state-of-the-art results across multiple 3D scene understanding benchmarks.

Abstract

We present a central-peripheral vision-inspired framework (CVP), a simple yet effective multimodal model for spatial reasoning that draws inspiration from the two types of human visual fields -- central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model's attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.

Method Overview

Illustration of CVP. Given visual tokens from multi-view images with 3D positional embeddings and a user question as input, we (1) incorporate a text-based allocentric grid to provide allocentric global scene context; and (2) introduce a special target-affinity token that guides the model to focus on target-related objects. During output generation, in addition to producing a language response, the representation of the target-affinity token is passed through an MLP and optimized with a contrastive loss against 3D object embeddings back-projected from multi-view 2D features. Positive samples correspond to ground-truth objects relevant to the question, while negatives are irrelevant. This contrastive supervision helps the model attend more effectively to semantically relevant targets.

Qualitative Results

3D Dense Captioning: The model is tasked to generate descriptions for a specific object in the 3D scene. The baseline Video-3D-LLM is drawn to the larger, more visually salient sink and fails to describe the actual target object, our target-affinity token steers the model to the soap dispenser itself, enabling precise localization and an object-specific caption.

3D Visual Grounding: The model is tasked to locate an object in the 3D scene given a description. In this example, the description involves multiple spatial relationships, such as “to the left of” and “closest to the left side of the room.”. The target-affinity token in CVP successfully retrieves the objects based on the question.

3D question answering with allocentric grid

3D Question Answering: The allocentric grid is helpful to CVP for answering the question that requires reasoning over both proximity and relative position within the scene.

Quantitative Results

Method	ScanRefer		Multi3DRefer
	Acc@0.25 (↑)	Acc@0.5 (↑)	F1@0.25 (↑)	F1@0.5 (↑)
Expert models
ScanRefer	37.3	24.3	—	—
MVT	40.8	33.3	—	—
ViL3DRel	47.9	37.7	—	—
BUTD-DETR	52.2	39.8	—	—
3DVG-Trans	45.9	34.5	—	25.5
3DJCG	49.6	37.3	—	26.6
M3DRef-CLIP	51.0	44.7	42.8	38.4
3D LMMs
3D-LLM	30.3	—	—	—
Chat-3D v2	35.9	30.4	—	—
Grounded 3D-LLM	47.9	44.1	45.2	40.6
Chat-Scene	55.5	50.2	57.1	52.4
LLaVA-3D	50.1	42.7	49.8	43.6
Video-3D-LLM	58.1	51.7	58.0	52.7
CVP (Ours)	62.0	55.4	60.2	54.7

Quantitative comparisons with SOTA models for 3D Visual Grounding on ScanRefer and Multi3DRefer.

Method	ScanQA (val)					SQA3D (test)
	CIDEr (↑)	BLEU-4 (↑)	METEOR (↑)	ROUGE (↑)	EM (↑)	EM (↑)
Expert models
Scan2Cap	—	—	—	—	—	41.0
ClipBERT	—	—	—	—	—	43.3
ScanRefer+MCAN	55.4	7.9	11.5	30.0	18.6	—
ScanQA	64.9	10.1	13.1	33.3	21.1	47.2
3D-VisTA	69.6	10.4	13.9	35.7	22.4	48.5
Zero-shot 2D LMMs
VideoChat2	49.2	9.6	9.5	28.2	19.2	37.3
Qwen2.5-VL-7B	53.9	3.0	11.4	29.3	—	46.5
LLaVA-Video	88.7	3.1	17.7	44.6	—	48.5
3D LMMs
3D-LLM	69.4	12.0	14.5	35.7	20.5	—
LL3DA	76.8	13.5	15.9	37.3	—	—
Chat-3D v2	87.6	14.0	—	—	—	54.7
Scene-LLM	80.0	12.0	16.6	40.0	27.2	54.2
LEO	101.4	13.2	20.0	49.2	24.5	50.0
Chat-Scene	87.7	14.3	18.0	41.6	21.6	54.6
LLaVA-3D	91.7	14.5	20.7	50.1	27.0	55.6
Video-3D-LLM	102.1	16.2	19.8	49.0	30.1	58.6
CVP (Ours)	107.1	17.8	20.8	50.9	31.2	62.3

Quantitative comparisons with SOTA models for 3D Question Answering on ScanQA and SQA3D.

Method	CIDEr (↑)	BLEU-4 (↑)	METEOR (↑)	ROUGE (↑)
Expert models
Scan2Cap	39.1	23.3	22.0	44.8
3DJCG	49.5	31.0	24.2	50.8
3D-VLP	55.0	32.3	24.8	51.5
3D-VisTA	61.6	34.1	26.8	55.0
Vote2Cap-DETR	61.8	34.5	26.2	54.4
3D LMMs
LL3DA	65.2	36.8	26.0	55.0
LEO	68.4	36.9	27.7	57.8
Chat-Scene	77.2	36.3	28.0	58.1
LLaVA-3D	79.2	41.1	30.2	63.4
Video-3D-LLM	83.8	41.3	28.9	62.3
CVP (Ours)	90.5	41.7	28.9	62.2

Quantitative comparisons with SOTA models for 3D Dense Captioning on Scan2Cap.

@InProceedings{chen2025cvp, title={CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning}, author={Chen, Zeyuan and Zhang, Xiang and Xu, Haiyang and Xie, Jianwen and Tu, Zhuowen}, booktitle = {WACV}, year = {2026}, }

CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning

Abstract

Method Overview

Qualitative Results

Quantitative Results

BibTeX