Soft Tail-dropping for Adaptive Visual Tokenization

ECCV 2026

1 University of California San Diego      2 Adobe
* Work done during an internship at Adobe.
STAT teaser

STAT enables (a) high-quality text-conditional image generation and (b) achieves the best gFID in class-conditional ImageNet generation among visual tokenizers. STAT learns to allocate tokens adaptively based on image complexity, as validated in (c), where predicted token count strongly correlates with JPEG file size.

Abstract

We present Soft Tail-dropping Adaptive Tokenizer (STAT), a discrete tokenizer that learns adaptive visual representations. STAT adjusts the number of tokens allocated to each image according to its perceptual complexity. Specifically, it encodes an image into discrete tokens together with token-wise keep probabilities indicating whether each token is necessary for faithful reconstruction or can be safely dropped. Through this learned adaptivity, STAT achieves state-of-the-art reconstruction quality while using fewer tokens on average. When integrated with vanilla causal autoregressive (AR) modeling, STAT enables a content-aware generative model with adaptive-length sampling. The model achieves competitive or superior visual generation quality compared with other generative model families while exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.

Method

STAT pipeline

Overview of STAT. STAT encodes an image into 1D discrete tokens together with token-wise keep probabilities. During training, Bernoulli sampling is applied to each token to decide whether it is kept or dropped. Two priors guide the learning of the keep-probability profiles: a content-adaptive prior encourages the sum of predicted probabilities to correlate with the perceptual complexity of the image, and a decreasing importance prior encourages probabilities to decrease from the head to the tail of the 1D token sequence, forming a soft tail-dropping pattern. At inference, tokens whose keep probabilities exceed a fixed threshold are retained, producing representations with adaptive token counts.

Learned Keep-probability Priors

Learned priors

The content-adaptive prior and the decreasing importance prior together induce keep-probability profiles that align token allocation with image complexity while concentrating information in the head tokens of the 1D sequence.

Results

Text-conditional Image Generation

Samples generated by a vanilla autoregressive model with STAT. Click any image to enlarge and read its prompt.



Adaptive Reconstruction

Drag the slider to decode the same image with more or fewer tokens. STAT's predicted token count (green) already matches the full-length reconstruction — simple images need fewer tokens, complex high-frequency images need more.


Adaptive Generation

Integrated with a vanilla causal autoregressive model, STAT generates with adaptive-length sampling. The model emits an end-of-sequence token (green) when the content is complete, well before the full 256-token budget.

BibTeX

If you find STAT useful for your research or applications, please cite it using this BibTeX:

@InProceedings{chen2026soft,
    author    = {Chen, Zeyuan and Zhang, Kai and Tu, Zhuowen and Xiong, Yuanjun},
    title     = {Soft Tail-dropping for Adaptive Visual Tokenization},
    booktitle = {ECCV},
    year      = {2026},
}