Cola Cola DLM

Continuous Latent Diffusion Language Model

Hierarchical latent prior modeling for text — a bridge to unified continuous-modality generation.

Hongcan Guo1,2, Qinyu Zhao3,†, Yian Zhao1,4, Shen Nie1,5, Rui Zhu1, Qiushan Guo1, Feng Wang1, Tao Yang1, Hengshuang Zhao2, Guoqiang Wei1, Yan Zeng1,✉
1ByteDance Seed 2The University of Hong Kong 3The Australian National University 4Peking University 5Renmin University of China
Work done during an internship at ByteDance Seed ·  Corresponding author
Seedance Team Research Project
Abstract

Hierarchical latent prior modeling for language

Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling.

We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization.

Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B AR and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. The results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, while suggesting a concrete path toward unified modeling across discrete text and continuous modalities.

Method

An overview of Cola DLM

Two training stages and one inference stage realize the joint p(x, z0) = pθ(x|z0)·pψ(z0): global semantics live in continuous latent space, local text is realized by the decoder.

Overall workflow of Cola DLM: Text VAE pretraining, joint Text VAE + block-causal Text DiT training, and KV-cached inference.
The overall workflow of Cola DLM. Training Stage 1: Text VAE pretraining with reconstruction, BERT, and KL losses. Training Stage 2: joint Text VAE + Text DiT training with gradient control; the DiT uses a block-causal attention mechanism. Inference Stage: prefix encoding, block-wise prior transport in latent space, and conditional decoding with KV cache.
Stage 1

Text VAE pretraining

Strictly causal encoder/decoder learn a stable text↔latent mapping. Trained with reconstruction NLL, KL to a base prior, and a BERT-style mask loss to prevent decoder shortcutting.

Stage 2

Block-causal DiT prior

A block-causal DiT learns the latent prior pψ(z0) via Flow Matching. The VAE keeps adapting under reconstruction, masking, and a reference KL regularizer to control latent drift.

Inference

Prefix → blocks → text

Encode the prefix to clean latents; transport noise blocks ε(b)~N(0,I) to clean latents under historical condition; conditional decoder produces the response. Block-causal KV-caching keeps inference efficient.

A unified perspective

Where does Cola DLM differ from AR, LLaDA, and Plaid?

Under a unified Markov-path view, the essential question is the state space and the role of the path: an observation path recovers text, a prior path only transports a latent.

Method State Space Path Role Generative Factorization Where Continuity Appears Explicit Latent
AR Prefix tokens Direct generation path i p(xi | x<i) None
LLaDA Discrete masked sequences Discrete observation–recovery path p(sT)·∏t pθ(st-1 | st) Discrete token space
Plaid Continuous token-aligned representations Continuous observation–recovery path p(hT)·∏t pθ(ht-1 | ht) Continuous token space
Cola DLM Compressed latent sequences Prior-transport path ∫ pθ(x | z0)·pψ(z0) dz0 Latent space

The diffusion in Cola DLM learns a flexible continuous prior, not a left-to-right inductive bias on text. The latent variable explicitly separates global semantic organization from local token realization.

Key findings

Four research questions, one consistent story

Across 8 benchmarks, strictly matched ~2B AR and LLaDA baselines, and scaling curves up to ~2000 EFLOPs, Cola DLM is competitive everywhere and shows the strongest late-stage gains.

RQ1 · Existence

Global semantic structure exists in latent space

The optimal timeshift drifts systematically with latent dimension, and the same drift appears across multiple semantic tasks — directly contradicting the separable-null hypothesis.

Optimal timeshift drifts with latent dimension across multiple semantic metrics.
  • Best loc shifts from ~1.0 (d=16) → ~1.7 (d=64) → ~2.3 (d=128).
  • Same drift on LAMBADA, MMLU, SIQA, and Task Avg.
  • Empirical peaks match the theoretical predictions (dashed lines).
RQ2 · Latent space

Joint VAE–DiT evolution from a stable init wins

Neither freezing the VAE nor training it from scratch is optimal. The latent space should evolve continuously and strongly with DiT — but starting from a meaningful pretrained init.

Task-average scaling: Joint DiT x1 vs Fix VAE, All Scratch, Joint DiT x0.01, Interval.
  • Joint DiT x1 scales best across Task Avg, LAMBADA, MMLU, SIQA.
  • All Scratch collapses the latent geometry; Interval & weak updates lag.
  • Adding a BERT-style loss further improves quality under active updates.
RQ3 · Diffusion process

Block size 16, loc=1 schedule, ~10–32 denoise steps

Moderate block size and proper noise calibration matter most. Inference-time hyperparameters show clear sweet spots — ~10 denoising steps already recover most of the gain.

Noise schedule loc=1.0 achieves the best Task Average at 40k steps.
  • DiT block size 16 beats both fully causal (1) and large blocks (64, 128).
  • Logit-normal schedule with loc=1.0 is best, especially for Joint DiT.
  • ~8–10 denoising steps already match the saturated Task Avg; CFG ≈ 3–7 is best.
RQ4 · Headline result

Strongest late-stage scaling vs. AR and LLaDA across 8 benchmarks

Under a strictly matched ~2B-parameter setup and a unified generative evaluation protocol — running scaling curves up to about 2000 EFLOPs — Cola DLM reaches the best final Task Average and shows the most encouraging headroom on reasoning-heavy benchmarks.

Scaling across 8 benchmarks plus Task Average — Cola DLM (red) vs AR (blue) and LLaDA (orange).
1Best Task Average at ~2000 EFLOPs, with the curve still rising.
2Clear lead on reasoning-heavy MMLU, RACE, Story Cloze, OBQA.
3SQuAD eventually surpasses AR and approaches LLaDA's strong region.
4Result is conservative — d=16, no extended training, room to scale further.
Detailed experiments

More results across the four research questions

Click any block below to expand its full set of curves and tables.

RQ2 · Fixed vs. evolving latent space — full benchmark grid
Task Average: Joint DiT x1 scales best.
Task Average. Joint DiT x1 keeps improving while Fix VAE saturates.
LAMBADA: Joint DiT x1 dominates.
LAMBADA. Joint DiT x1 ends with the strongest score.
MMLU: Joint DiT x1 wins late.
MMLU. All Scratch lags; Interval/x0.01 plateau early.
SIQA: Joint DiT x1 ahead.
SIQA. Same picture — joint co-adaptation is the right strategy.
RQ2 · Latent-space visualization explains why
All Scratch d=16 collapses the latent.
All Scratch, d=16. Collapsed, dominated by simple outward drift.
All Scratch d=128 partially recovers.
All Scratch, d=128. Higher dim partially mitigates collapse but stays unstructured.
Joint DiT d=16 with stable init produces structured trajectories.
Joint DiT, d=16. Heterogeneous, semantically organized trajectories.
RQ2 · Latent dimensionality and VAE logSNR ablations

Table 1 · Latent dimensionality at 117 EFLOPs (All Scratch, loc=1).

MethodLambadaMMLUSIQAAvg.
All Scratch, d=16, loc=114.36.94.98.7
All Scratch, d=64, loc=120.95.47.611.3
All Scratch, d=128, loc=118.58.18.911.8

Larger latent dimensions improve overall semantic capacity, with the strongest gains on MMLU and SIQA.


Table 2 · VAE logSNR setting under two compute budgets.

Method EFLOPs = 77.86 EFLOPs = 116.78
LambadaMMLUSIQAAvg. LambadaMMLUSIQAAvg.
Fixed VAE logSNR = 1.027.15.711.314.7030.47.718.418.83
Fixed VAE logSNR = 1.529.57.817.518.2733.88.023.621.80
Fixed VAE logSNR = 2.030.95.114.316.7732.79.719.520.63
Learnable VAE logSNR (≈ 4.5)32.67.916.218.9034.610.121.622.10

A learnable VAE logSNR is best overall; fixed logSNR = 1.5 is the strongest fixed alternative.

RQ2 · Semantic smoothness (BERT-loss) helps especially under active updates
BERT loss helps Task Avg.
Task Average. BERT loss + lr=1 gives the strongest curve.
BERT loss on LAMBADA.
LAMBADA. Same trend — semantic guidance + active updates.
BERT loss on MMLU.
MMLU. Strong latent evolution requires semantic guidance.
BERT loss on SIQA.
SIQA. Trainability alone is not enough.
RQ3 · DiT block size — moderate (16) is best
DiT block size at 30k checkpoint.
30k steps. Block size 16 leads on Task Avg and most tasks.
DiT block size at 40k checkpoint.
40k steps. Same picture — too-large blocks hurt SIQA & MMLU.
RQ3 · Noise schedule — loc = 1 is consistently best
Noise schedule on Task Average.
Task Average. loc=1 + Joint DiT scales best across compute.
Noise schedule on LAMBADA.
LAMBADA. Mismatched schedules under-perform Fix VAE.
Noise schedule on MMLU.
MMLU. Joint DiT only beats Fix VAE under proper calibration.
Noise schedule on SIQA.
SIQA. Noise calibration acts on the semantic-information axis.
RQ3 · Inference — denoising steps & CFG
Task accuracy vs denoising steps — saturates around 16–32.
Denoising steps. ~8–10 steps already recover most of the gain. Block size 16 ⇒ idealized 1.6–2.0× shorter sequential depth than AR.
Task accuracy vs CFG — moderate values are best.
Classifier-Free Guidance. Best around CFG ≈ 3–7; very large values clearly distort the denoising trajectory.
Discussion

Beyond the core framework

Four findings that shape how Cola DLM should be evaluated and extended.

Likelihood ≠ Generation Quality

Generation only requires the prior mass to reach decoder-valid regions, while likelihood-oriented PPL additionally demands precise local probability calibration around the gold posterior. The two metrics target different properties.

Local latent geometry around ground-truth tokens vs prior-density landscapes.
Top: local latent geometry around representative ground-truth tokens. Bottom: corresponding prior-density landscapes. High decoder-probe success and posterior hit contrast with sharply varying prior hit and density alignment.
Generation quality reflects semantic smoothness of the latent space, while likelihood-derived PPL is more sensitive to probability-space smoothness shaped by the VAE logSNR — so we evaluate scaling under a unified generative protocol.

First-block conditioning — clean repaint wins

The first generation block contains both known prompt latents and unknown latents to be generated. Strong, persistent conditioning is much more effective than partial noisy correction or positional layout alone.

Four conditioning/padding strategies for the first generation block.
Clean repaint, partial repaint, and left/right padding strategies for the first-block mixed denoising problem.
Task Partial repaint (t=1) Partial repaint (t=3) Clean cond. Left pad. Right pad.
m=1.0m=0.7m=0.3 m=1.0m=0.7m=0.3
Lambada8.58.56.67.07.35.637.124.624.7
MMLU7.97.97.87.66.77.011.98.411.5
SIQA8.88.78.213.313.012.024.814.913.8
Avg.8.48.47.59.39.08.224.616.016.7
Clean condition repaint dominates by a large margin. Repeated early corrections cannot compensate for weak conditioning.

Latent compression — toward both better abstraction and faster generation

Should the Text VAE compress the sequence? We train two VAEs with the same latent dimension (d=128), differing only in patch size: p1 maps each token to one latent, while p2 compresses every two tokens into one latent. All other settings match the headline run (DiT block size 16, logit-normal schedule with loc=1, 16 inference steps, CFG = 7).

At first glance, p2 looks much weaker overall — but splitting samples by whether the prompt length is divisible by the patch size tells a very different story. The gap is dominated almost entirely by Mod1 (odd-length prompts). On Mod0, where the latent grouping aligns with the text sequence, p2 actually surpasses p1 on average.

Sample Label Overall Prompt Len Mod0 Prompt Len Mod1
p1p2 p1p2 p1p2
Lambada31.1017.4032.1134.5530.120.79
MMLU5.403.906.897.683.860.00
SIQA11.106.1012.9212.139.260.00
Avg.15.879.1317.3118.1214.410.26

Patch size 2 fails on Mod1 not because compression itself is harmful, but because odd-length prompts force the encoder to produce semantically shifted latents at the patch boundary — and in Cola DLM that latent is the clean condition for subsequent block-wise prior generation, so the bias propagates into denoising and decoding. Once boundary handling is robust, the case is open for using larger patch sizes.

Under the same DiT block size, one denoising block covers patch_size × block_size text tokens after decoding. With block size 16, p1 covers 16 tokens per block while p2 covers 32. So once the boundary issue is fixed, latent compression promises both stronger semantic abstraction and faster generation — and the Mod0 result already shows that compressed latents do not have to hurt quality. This is consistent with the core idea of Cola DLM: the latent is not a token-aligned recovery code, but a lower-rate representation for global semantic organization.

VAE latent reconstruction is robust, not a fragile code

Cola DLM relies on the Text VAE's latent space being a stable intermediate representation — semantic information should degrade gracefully under perturbation rather than collapse abruptly. We measure reconstruction accuracy as a function of the diffusion timestep (i.e., increasing perturbation strength).

VAE reconstruction stays near-perfect at t=0, ~0.96 at t=100, ~0.92 at t=250, then degrades more under heavier noise.
Robustness of VAE latent reconstruction. Near-perfect reconstruction at t=0; accuracy stays high (≈ 0.96) through the low-noise regime, remains around 0.92 at t=250, and degrades only under heavier noise.
The graceful-degradation pattern means that small or moderate latent perturbations do not destroy semantic information abruptly. The Text VAE latent space is therefore sufficiently robust to serve as the semantic interface for downstream prior modeling — exactly the property hierarchical latent prior modeling requires.
Beyond text

Towards unified text–image modeling preliminary

Modality-specific encoders/decoders share a common block-causal MMDiT prior over a joint latent state. The same hierarchical decomposition extends naturally from text to vision.

Unified text-image: text-to-text, image-to-text, text-to-image, and the shared MMDiT prior schematic.
Unified text–image qualitative samples. Left: text-only continuation and image-conditioned text generation. Middle: text-to-image results (pretraining only). Right: schematic — modality-specific VAE encoders/decoders interface with a shared block-causal MMDiT that organizes the joint latent.

All text-to-image samples

Generated from text prompts under in-house pretraining only (256 → 640 resolution, no SFT, no high-quality data curation). Click any image to view it full size — use ←/→ to browse.

The hierarchical latent-prior formulation of Cola DLM extends beyond text-only generation: a shared prior organizes global cross-modal semantics, while modality-specific decoders handle final realization. This is intentionally early-stage; comprehensive unified multimodal training is left for future work.

Reflections · Afterword

Why a hierarchical latent-space view matters

We reproduce the paper's afterword in full below — the broader picture of learning that Cola DLM is in dialogue with. Three threads — representation, objective, and environment — are not separate, but three faces of one systematic question about how a learning system should organize information.

From the paper

Afterword: Research Objectives and Significance

Viewed from a broader perspective, this study is not only concerned with proposing an alternative architecture for text generation, but also with clarifying a more general picture of learning in which representation, objective, and environment must be understood jointly. From this perspective, the three themes of this work are closely connected rather than independent. The first concerns how text should be represented and generated. The second concerns what kinds of objectives and evaluation criteria are genuinely aligned with such representations. The third concerns the kind of environment in which a model should ultimately learn if the goal is more general multimodal intelligence.

A useful starting point is to formalize learning itself as a model–environment interaction system. Let the environment be

\[ \mathcal{E} = (\Omega,\, \mathcal{O},\, \mathcal{A},\, \mathcal{T},\, \mathcal{F},\, \mathcal{G}), \tag{1} \]

where \(\Omega\) is the environment state space, \(\mathcal{O}\) is the observation space, \(\mathcal{A}\) is the action or output space, \(\mathcal{T}\) is the state transition mechanism, \(\mathcal{F}\) is the feedback generation mechanism, and \(\mathcal{G}\) is the rule that converts feedback into optimization signals. Importantly, the notion of environment is understood here in a broad sense: it includes not only the external world, but also the data distribution presented to the model, task formats, supervision protocols, and even the loss rules by which feedback is transformed into gradients.

Let the model be denoted by \(M_\theta\), with internal state space \(\mathcal{H}\), state update map \(U_\theta\), and policy or generation map \(\Pi_\theta\). At interaction step \(t\), the closed-loop system can be written as

\[ \begin{aligned} o_t &\sim P_{\mathcal{E}}(\cdot \mid \omega_t), \\ h_t &= U_\theta(h_{t-1}, o_t), \\ a_t &\sim \Pi_\theta(\cdot \mid h_t), \\ \xi_t &\sim \mathcal{F}(\cdot \mid \omega_t, o_t, a_t), \\ \omega_{t+1} &\sim \mathcal{T}(\cdot \mid \omega_t, o_t, a_t, \xi_t), \\ \ell_t &= \mathcal{G}(\omega_t, o_t, a_t, \xi_t). \end{aligned} \]

The overall learning objective is therefore

\[ \mathcal{J}(\theta;\, \mathcal{E}) \;=\; \mathbb{E}_{\tau \sim P(\tau \mid \theta, \mathcal{E})} \!\left[\sum_{t=1}^{T} \gamma^{t-1}\, \ell_t \right], \tag{2} \]

where \(\tau\) denotes a complete interaction trajectory and \(\gamma\) is the discount factor.

This formalization shows directly that learning is never an isolated question of model structure alone. Rather, it is jointly determined by three factors: first, the state space in which the model absorbs and organizes information; second, the kind of feedback through which the environment defines improvement; and third, the actual structure that generates observations, transitions, and feedback. In this work, these three aspects correspond precisely to the three recurring themes of the paper: how text should be represented, which metrics are aligned with the true learning objective, and what kind of environment unified models are ultimately meant to enter.

1 · Rethinking text modeling: from state space to hierarchical generation

From a system-level perspective, the central question of text modeling is not merely which generation order to adopt, but rather in what kind of state text should be represented within the learning system. Mainstream autoregressive language models bind the state tightly to the surface token prefix, and generation is therefore written as

\[ p_{\mathrm{AR}}(x) = \prod_{t=1}^{n} p_\theta(x_t \mid x_{<t}). \tag{3} \]

This factorization is highly effective, but it fundamentally corresponds to a strong modeling assumption: both global semantics and local realization are propagated through the same token-level conditional chain. In other words, it assumes that the surface string itself is the most natural and primary state space.

The route explored in this paper instead reconsiders text generation from the level of the state space itself. If text indeed contains a low-dimensional yet sufficiently useful global semantic structure, then a more natural approach is not to place the entire burden of generation on a token-level chain factorization, but to introduce latent variables explicitly and model high-level semantic organization separately from local textual realization. Correspondingly, the core factorization of Cola DLM is

\[ p(x, z_0) = p_\theta(x \mid z_0)\, p_\psi(z_0), \qquad p(x) = \int p_\theta(x \mid z_0)\, p_\psi(z_0)\, dz_0, \tag{4} \]

where \(z_0\) is a continuous latent variable, \(p_\psi(z_0)\) is the latent prior, and \(p_\theta(x \mid z_0)\) is the conditional decoder. The crucial change here is not merely the introduction of latent variables, but the redefinition of the role of state in the system: the path no longer acts directly on observation recovery, but instead organizes global semantics in latent space first, after which the decoder carries out local textual realization.

This point can be stated compactly through the information decomposition of the average ELBO. Let \(q(x, z_0) := p_{\mathrm{data}}(x)\, q_\phi(z_0 \mid x)\); then

\[ \mathbb{E}_{p_{\mathrm{data}}(x)}\!\big[\mathcal{L}_{\mathrm{ELBO}}(x)\big] = \mathbb{E}_{q(x, z_0)}\!\big[\log p_\theta(x \mid z_0)\big] \,-\, I_q(X;\, Z_0) \,-\, \mathrm{KL}\!\big(\bar q_\phi(z_0)\,\|\,p_\psi(z_0)\big), \tag{5} \]

where \(\bar q_\phi(z_0)\) is the aggregated posterior. This decomposition shows that hierarchical latent-space modeling breaks the text problem into three coupled but analytically distinguishable components: conditional realization, information compression, and prior matching. The latent variable is therefore not merely a continuous surrogate for discrete tokens, but an explicit intermediate state through which global semantic organization can be separated from local textual realization and modeled on its own terms.

From this perspective, compression must also be reconsidered. Prior work has emphasized the connection between compression and intelligence (Huang et al., 2024), while recent explorations of generation closer to raw data forms in images and videos, such as pixel-space modeling (Deng et al., 2026), further suggest that compression should not be equated with harmful information deletion. The key question is not whether every local detail is preserved, but whether the model can extract and organize structural information that is genuinely effective and generalizable. If text indeed admits a hierarchical structure in which high-level semantics and low-level realization are relatively separable, then reinterpreting text generation through informational hierarchy is not merely a change of method, but a theoretical re-evaluation of text modeling itself.

Accordingly, the first theme of this paper is not to reject autoregression, but to point out that autoregression occupies only one self-consistent, rather than unique, corner of the design space. If the data truly contains a hierarchy between low-dimensional global semantics and high-dimensional local realization, then organizing semantics first in a latent state and realizing text through conditional decoding may be closer to the true generative mechanism. Text generation should therefore not be understood solely as next-token fitting over discrete strings, but more generally as a systematic problem of how information is represented, compressed, and organized hierarchically.

2 · Continuous extension of discrete text: a shift in evaluation emphasis

Once the state space of the system is changed, the issue at the objective level changes accordingly. For conventional autoregressive language models, the training objective and evaluation quantities are naturally well aligned: maximum likelihood training directly corresponds to probability fitting over text, and likelihood and perplexity therefore have a clear and stable interpretation. In hierarchical continuous latent-space models, however, the actual training path is no longer direct token-level maximum likelihood, but a hierarchical objective jointly composed of reconstruction, latent prior learning, and representation regularization.

This can be seen from the relation between the ELBO and the true marginal likelihood:

\[ -\mathcal{L}_{\mathrm{ELBO}}(x) \;=\; -\log p_{\theta,\psi}(x) \;+\; \mathrm{KL}\!\big(q_\phi(z_0 \mid x)\,\|\,p_{\theta,\psi}(z_0 \mid x)\big). \tag{6} \]

This shows that even at the level of the ELBO, the training objective is already separated from the true log-likelihood by a variational inference gap. Furthermore, in the actual training of Cola DLM, the model must jointly learn latent reconstruction, continuous prior fitting, and representation stabilization. The quantity being optimized is therefore not a single token-level likelihood in the classical sense.

For this reason, the mismatch should not be interpreted as a failure to learn, but rather as evidence that the model is learning something different. For autoregressive models and other paradigms that directly fit discrete distributions, likelihood and perplexity remain highly informative because they are naturally aligned with the training objective. For hierarchical continuous latent-space models, by contrast, the central issue is no longer whether local discrete distributions are fitted as sharply as possible, but whether higher-level semantic structures are effectively organized, whether the latent prior is well learned, and whether the final generations satisfy the actual task requirements.

From the perspective of systematic modeling, this phenomenon is in fact expected: when the state space expands from surface tokens to hierarchical latent variables, the optimization target correspondingly shifts from precise fitting of local discrete distributions to the organization of higher-level semantic structure, stable latent prior learning, and satisfaction of the true generative objective. For this route, generation-oriented metrics are therefore often more closely aligned with what the model is actually trained to do than perplexity. More importantly, model potential is often reflected more clearly in scaling behavior than in any single static likelihood value: what matters is whether capability continues to improve steadily as model size, data, and compute increase, rather than whether local fit under a particular pointwise metric is better.

This can also be connected to the perspective of the three governing curves developed in the theoretical analysis of the paper. For Cola DLM, the applicability of this route is not determined by a single likelihood value, but by whether three conditions hold simultaneously: the representation rate–distortion curve is already favorable at relatively low rate, the approximation error of the latent prior continues to decrease, and the inference gap remains controllable. In other words, the advantage of this route is not guaranteed automatically by latent variables or flow-based modeling themselves; it depends on whether the data truly contains a compressible global semantic structure, and whether the model can learn, fit, and realize that structure in a stable manner.

The second theme of this paper is therefore not merely that perplexity is inadequate, but that evaluation language itself must change once representation and objective have changed. For this class of models, generation quality and scaling behavior are often closer to the model's true capability and long-term potential than traditional perplexity.

3 · Unified models, model–environment interaction, and the value of multimodal unification

If we return again to the model–environment formalization in Eq. (1) and Eq. (2), the third theme becomes more natural. The importance of unified models does not lie merely in placing multiple modalities within a single parameterized network, but in changing the structure of the environment in which the model learns. In the real world, observations, transitions, and feedback are usually not generated independently across modalities; rather, they are often jointly determined by a shared latent state. A more general learning system therefore requires not a set of isolated modality interfaces placed side by side, but unified representations that can enter the same interaction state and share the same dynamical constraints.

This is closely related to two broader views of intelligence. One influential view understands intelligence as a collection of skills across tasks (Chollet, 2019). Under this view, a system becomes more capable because it can solve problems across more domains and under more diverse forms of supervision and interaction. The recent development of large language models partly reflects this tendency. A representative example is the progress of code agents that can operate within command-line environments. In such environments, the observation space, action space, and feedback mechanism are unusually well aligned with discrete symbolic representations. Interaction trajectories are easy to record, and correctness is often straightforward to verify, so these environments provide dense and precise learning signals.

Another view, closer to the world-model perspective, holds that intelligence consists in acquiring an internal model of the structure and dynamics of the world. Recent work on world models (Tu et al., 2025) moves in this direction by seeking to learn richer environmental dynamics, thereby supporting stronger generalization and more realistic interaction. From this perspective, the question is not only how many tasks a model can solve, but whether it learns in an environment whose structure is rich enough to induce the right abstractions. The environment therefore becomes central: a model can only internalize the regularities that are actually present in the observations, transitions, and feedback it encounters.

This can also be written more formally. Let the observation at step \(t\) be multimodal,

\[ o_t = \big(o_t^{(1)},\, o_t^{(2)},\, \dots,\, o_t^{(M)}\big), \qquad o_t^{(m)} \in \mathcal{O}^{(m)}, \tag{7} \]

and suppose there exists a joint latent state

\[ z_t = \Phi\!\left(o_t^{(1)},\, \dots,\, o_t^{(M)}\right), \tag{8} \]

such that feedback and transition depend primarily on this joint state rather than on marginal factorizations over modalities:

\[ \xi_t,\, \omega_{t+1} \;\sim\; p\!\left(\xi_t,\, \omega_{t+1} \mid z_t,\, a_t\right). \tag{9} \]

If the true environmental dynamics satisfy

\[ p\!\left(\xi_t,\, \omega_{t+1} \mid o_t,\, a_t\right) \;\neq\; \prod_{m=1}^{M} p_m\!\left(\xi_t^{(m)},\, \omega_{t+1}^{(m)} \mid o_t^{(m)},\, a_t^{(m)}\right), \tag{10} \]

then the learning problem is structurally non-separable across modalities. In such a case, treating each modality as an independent channel and only combining them superficially is generally insufficient. The theoretical significance of unified models lies precisely in the fact that the environment itself is non-separable in the sense of Eq. (10): the regularities that determine useful feedback are joint regularities rather than regularities defined on the marginal distribution of each modality.

This clarifies why multimodal unification is not merely an engineering convenience. Its purpose is not simply to process multiple data types with one backbone, but to allow the model to learn in an environment whose observation, transition, and supervision structure more faithfully reflects the coupled regularities of the real world. In such an environment, both inputs and outputs may be multimodal; useful feedback may depend on how different modalities constrain each other jointly; and the learned internal state should ideally reflect these joint constraints.

This also explains why text has long been the most difficult component in unified models. Images and videos naturally operate in continuous spaces, whereas text is a prototypically discrete modality. If they are to enter a common interaction state and share latent dynamics, a severe representational mismatch immediately arises. This is precisely one of the central obstacles repeatedly identified in recent unified-model research (Deng et al., 2025). In this sense, the significance of Cola DLM lies not only in proposing another text generator, but in providing a natural interface through which discrete text can enter a continuous latent space.

If discrete text is mapped into a continuous latent variable through

\[ z^{\mathrm{text}} \sim q_\phi(z \mid x^{\mathrm{text}}), \qquad x^{\mathrm{text}} \sim p_\eta(x \mid z^{\mathrm{text}}), \tag{11} \]

then text acquires an interface compatible with other continuous modalities. One may then define a unified interaction state

\[ \tilde z_t \;=\; \Psi\!\left(z_t^{\mathrm{text}},\, z_t^{\mathrm{img}},\, z_t^{\mathrm{vid}},\, \dots \right), \tag{12} \]

and perform state evolution, decision making, and feedback modeling at this level. Equations (11)–(12) formalize why Cola DLM may matter beyond text generation itself: its role is not only to generate text through a different path, but to provide a bridge through which an intrinsically discrete modality can participate in a continuous multimodal interaction state. In other words, it reduces the structural mismatch that otherwise prevents text from naturally entering a shared continuous environment.

This is why the broader significance of Cola DLM is better understood through model–environment interaction than through single-modality benchmarks alone. If learning is viewed as the optimization of Eq. (2) in richer and more realistic environments, then unified models matter because they expand the environments in which the model can learn. If text is to participate fully in such environments, then a bridge such as that in Eq. (11) becomes especially desirable. In this sense, Cola DLM is not merely an alternative text generator; it can also be understood as a candidate mechanism for aligning discrete text with continuous multimodal learning systems.

4 · The three themes under a unified perspective

In summary, the three themes of this paper are not separate supplementary discussions, but three manifestations of the same systematic problem. The first concerns the representation level: whether text should be modeled entirely on the token surface, or whether higher-level semantics can be organized in an independent latent state. The second concerns the objective level: once the model is trained through latent transport, reconstruction, and regularization rather than direct token-level maximum likelihood, which metrics remain genuinely aligned with the learning problem. The third concerns the environment level: if learning is ultimately model–environment interaction, then what kind of environment future models should inhabit, and what representational interfaces are needed for different modalities to become compatible within it.

From this perspective, autoregressive language modeling occupies a self-consistent corner of the design space: representation is tightly bound to surface tokens, the training objective is direct likelihood maximization, and the environment is largely symbolic and text-centered. The route explored in this work changes all three assumptions simultaneously. It introduces a hierarchical latent-variable representation for text, thereby changing the representational assumption; it moves optimization away from direct token-level likelihood, thereby weakening the central interpretive role of perplexity; and it provides a continuous interface for discrete text, thereby making text potentially more compatible with multimodal environments that are more naturally expressed in continuous latent space.

We therefore hope that the contribution of this work is not only a viable alternative path for text generation, but also a more systematic way of thinking that jointly considers representation, objective alignment, and environment design. More broadly, we hope it encourages future research to rethink text, images, videos, and other modalities not as isolated domains that must be solved separately, but as components of a larger learning system in which unified representation, unified objectives, and unified environments may become increasingly central to the development of more general multimodal intelligence.

Cite

BibTeX

@article{guo2026cola,
  title   = {Cola DLM: Continuous Latent Diffusion Language Model},
  author  = {Guo, Hongcan and Zhao, Qinyu and Zhao, Yian and Nie, Shen
             and Zhu, Rui and Guo, Qiushan and Wang, Feng and Yang, Tao
             and Zhao, Hengshuang and Wei, Guoqiang and Zeng, Yan},
  journal = {arXiv preprint arXiv:2605.06548},
  year    = {2026}
}