X-Dancer

X-Dancer: Expressive Music to Human Dance Video Generation

ICCV 2025 Highlight

¹ UC San Diego ² ByteDance ³ University of Southern California

Abstract

We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism.

Method

The overview of X-Dancer. We propose a cross-conditional transformer model to generate 2D human poses synchronized with input music, followed by a diffusion model to produce high-fidelity videos from a single reference image. First, we develop a compositional tokenization method for 2D poses, encoding each body part independently with confidence-aware quantization. A shared decoder merges these tokens into a full-body pose. Next, a GPT-based transformer autoregressively predicts future pose tokens with causal attention, conditioned on past poses and aligned music embeddings, as well as global music styles and prior motion context. A learnable motion decoder generates multi-scale pose guidance upsampled from a feature map, integrating generated motion tokens over a temporal window (16 frames) via AdaIN. By co-training the motion decoder and temporal modules, our diffusion model synthesizes temporally smooth, high-fidelity video frames consistent with the reference image using a trained reference network.

Single Reference, Multiple Music

BibTeX

@InProceedings{Chen_ICCV_2025, author = {Chen, Zeyuan and Xu, Hongyi and Song, Guoxian and Xie, You and Zhang, Chenxu and Chen, Xin and Wang, Chao and Chang, Di and Luo, Linjie}, title = {X-Dancer: Expressive Music to Human Dance Video Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025}, }

X-Dancer: Expressive Music to Human Dance Video Generation

Abstract

Method

Results

Single Reference, Multiple Music

Single Music, Multiple References

Single Music, Single Reference

Long Video Generation

Fine-tuning to Characterized Dances (Music: Subject 3)

Fine-tuning to Characterized Dances (Music: Go Stronger )

Baseline Comparisons

Results with X-Dancer-AIST

BibTeX

Ethics Concerns