Soft Tail-dropping for Adaptive Visual Tokenization

ECCV 2026

Zeyuan Chen^1* Kai Zhang² Zhuowen Tu¹ Yuanjun Xiong²

¹ University of California San Diego ² Adobe

* Work done during an internship at Adobe.

STAT enables (a) high-quality text-conditional image generation and (b) achieves the best gFID in class-conditional ImageNet generation among visual tokenizers. STAT learns to allocate tokens adaptively based on image complexity, as validated in (c), where predicted token count strongly correlates with JPEG file size.

Abstract

We present Soft Tail-dropping Adaptive Tokenizer (STAT), a discrete tokenizer that learns adaptive visual representations. STAT adjusts the number of tokens allocated to each image according to its perceptual complexity. Specifically, it encodes an image into discrete tokens together with token-wise keep probabilities indicating whether each token is necessary for faithful reconstruction or can be safely dropped. Through this learned adaptivity, STAT achieves state-of-the-art reconstruction quality while using fewer tokens on average. When integrated with vanilla causal autoregressive (AR) modeling, STAT enables a content-aware generative model with adaptive-length sampling. The model achieves competitive or superior visual generation quality compared with other generative model families while exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.

Method

Overview of STAT. STAT encodes an image into 1D discrete tokens together with token-wise keep probabilities. During training, Bernoulli sampling is applied to each token to decide whether it is kept or dropped. Two priors guide the learning of the keep-probability profiles: a content-adaptive prior encourages the sum of predicted probabilities to correlate with the perceptual complexity of the image, and a decreasing importance prior encourages probabilities to decrease from the head to the tail of the 1D token sequence, forming a soft tail-dropping pattern. At inference, tokens whose keep probabilities exceed a fixed threshold are retained, producing representations with adaptive token counts.

Learned Keep-probability Priors

The content-adaptive prior and the decreasing importance prior together induce keep-probability profiles that align token allocation with image complexity while concentrating information in the head tokens of the 1D sequence.

Results

Text-conditional Image Generation

Samples generated by a vanilla autoregressive model with STAT. Click any image to enlarge and read its prompt.

A pineapple surfing on a wave

A painting of a fox in the style of starry night

A colorful coral reef bustling with marine life

A snowy mountain peak with blue sky

A steampunk airship floating above a Victorian-era city

A painting of a sport car in the style of Monet

The word “START” written on a street surface

A watercolor painting of a small European village by a river

A spaceship descending into a volcanic alien landscape

A glowing magical sword floating above an ancient altar

A medieval alchemist's laboratory filled with mysterious potions

A cozy cabin interior with a fireplace during a snowstorm

A wolf standing on a snowy ridge during golden hour

A pickup truck driving through a desert environment

A vintage red sports car parked on a coastal highway at sunset

A cup of cappuccino with latte art

A high-speed photograph of a splash forming a crown shape

A woman in a minimalistic studio portrait

An illustration of a teapot

Yin-yang

Adaptive Reconstruction

Drag the slider to decode the same image with more or fewer tokens. STAT's predicted token count (green) already matches the full-length reconstruction — simple images need fewer tokens, complex high-frequency images need more.

Adaptive Generation

Integrated with a vanilla causal autoregressive model, STAT generates with adaptive-length sampling. The model emits an end-of-sequence token (green) when the content is complete, well before the full 256-token budget.

BibTeX

If you find STAT useful for your research or applications, please cite it using this BibTeX:

@InProceedings{chen2026soft,
    author    = {Chen, Zeyuan and Zhang, Kai and Tu, Zhuowen and Xiong, Yuanjun},
    title     = {Soft Tail-dropping for Adaptive Visual Tokenization},
    booktitle = {ECCV},
    year      = {2026},
}