X-Dancer: Expressive Music to Human Dance Video Generation

ICCV 2025

Zeyuan Chen 1,2    Hongyi Xu 2    Guoxian Song 2    You Xie 2    Chenxu Zhang 2   
Xin Chen 2    Chao Wang 2    Di Chang 2,3    Linjie Luo 2   
1 UC San Diego      2 ByteDance      3 University of Southern California

Abstract

We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism.

Method

MY ALT TEXT

The overview of X-Dancer. We propose a cross-conditional transformer model to generate 2D human poses synchronized with input music, followed by a diffusion model to produce high-fidelity videos from a single reference image. First, we develop a compositional tokenization method for 2D poses, encoding each body part independently with confidence-aware quantization. A shared decoder merges these tokens into a full-body pose. Next, a GPT-based transformer autoregressively predicts future pose tokens with causal attention, conditioned on past poses and aligned music embeddings, as well as global music styles and prior motion context. A learnable motion decoder generates multi-scale pose guidance upsampled from a feature map, integrating generated motion tokens over a temporal window (16 frames) via AdaIN. By co-training the motion decoder and temporal modules, our diffusion model synthesizes temporally smooth, high-fidelity video frames consistent with the reference image using a trained reference network.

Results

Single Reference, Multiple Music

Single Music, Multiple References

Single Music, Single Reference

Long Video Generation

Fine-tuning to Characterized Dances (Music: Subject 3)

Fine-tuning to Characterized Dances (Music: Go Stronger )

Baseline Comparisons

Results with X-Dancer-AIST

BibTeX

If you find X-Dancer useful for your research or applications, please cite X-Dancer using this BibTex:

@InProceedings{Chen_ICCV_2025,
        author    = {Chen, Zeyuan and Xu, Hongyi and Song, Guoxian and Xie, You and Zhang, Chenxu and Chen, Xin and Wang, Chao and Chang, Di and Luo, Linjie},
        title     = {X-Dancer: Expressive Music to Human Dance Video Generation}, 
        booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        year      = {2025},
}

Ethics Concerns

The images and music used in demos are from public sources or generated by models, and are solely used to demonstrate the capabilities of this research work. If there are any concerns, please contact us (zec016@ucsd.edu) and we will delete it in time.