Shared Space Is Not Enough

As far as I can tell, one of the main challenges in making truly intelligent VLMs boils down to how we handle the differences between text and images inside the model. We've been scaling models and feeding them tons of data, and yes, they improve on certain tasks. But they still feel fragile, more like they're guessing based on correlations than really understanding.

1 The Core Tension

Text and images are fundamentally different. Their data structures and demands don't line up neatly, yet current models try to cram them into the same shared space.

The problem is that each modality ends up competing for the limited resources in that space to serve its own needs. This mismatch causes tension.

Consider the final hidden state in a transformer:

token → layers → final hidden state

By design, it compresses everything into a representation optimized for predicting the next text token. Expecting it to also serve as a rich spatial descriptor for images feels like a fundamental mismatch.

2 Modality Separation at the Boundaries

The solution may start with separation. Instead of merging inputs too early and splitting outputs too late in a single backbone, we should:

Insert dedicated transformer blocks at the modality boundaries
Allow each modality to process its inputs before entering the shared space
Decode outputs using separate blocks, not just projection layers

Why transformer blocks? Because mapping between raw modalities and abstract shared space is complex. It's not a linear transformation, it's computation heavy and needs deep context mixing.

This gives us a structure like:

text/image → modality-specific blocks → shared core → decoder-specific blocks → output

3 The Role of the Shared Core

If we build a middle "abstract" space, what should it actually do?

For image tokens, a few architectural shifts are essential:

3.1 Bidirectional Attention on Image Tokens

Applying causal attention masks to image patches makes little sense. Vision isn't sequential.

Instead:

Let every image token attend to every other token
Avoid directional bias
Treat image understanding as spatial reasoning

This turns image processing into an active reasoning task where each token gathers context from everywhere else, building a global understanding rather than just passing features forward.

3.2 Masked Training for Active Image Understanding

Masked Autoencoders showed that hiding parts of an image forces models to build richer representations. Instead of passive encoders feeding image features to a text loss, we need to:

Replace patches with learned mask tokens
Train the model to decode masked regions
Encourage deeper image-specific context learning

This gives vision its own task-agnostic training signal, parallel to language modeling.

3.3 Multiscale Vision Tokens

Images operate across scales:

Local texture
Mid level shapes
Global structure

A robust model must predict masked patches at multiple resolutions simultaneously, training the encoder end-to-end.

Input image → patch + scale → masked token prediction → context learning

This avoids reliance on pretrained encoders and supports emergent alignment like CLIP.

4 Connecting Understanding to Generation

Here’s where things get exciting.

By using the same shared core and masked prediction objective, we can unify image understanding and image generation.

Add a diffusion model head to the shared space
Use the same latent features for both tasks
Train on both masked understanding and image generation

This symmetry helps the model learn representations that serve both perception and synthesis.

5 The Trick: Dual Input During Training

Add both the masked and unmasked image at every scale. Yes, it doubles the tokens. But:

It gives the model full ground truth to learn from
It allows comparison between prediction and reality
It enables self correction by detecting inconsistencies

During generation, the model can feed its own output into the next resolution step. That feedback loop allows refinement.

6 Benefits of This Architecture

What does all this buy us?

Stronger understanding from active training, not passive feature reuse
Better generation by tying perception to synthesis
Surprisal based reasoning by masking and comparing outputs
Visual chain-of-thought from deeply aligned modality interaction

Instead of guessing from correlations, models may start reasoning across images and text more robustly.

7 Open Challenges

This isn't free. Several technical challenges remain:

How to interleave text and image streams efficiently
How to scale bidirectional attention across long image sequences
How to initialize special tokens meaningfully in this architecture

Hybrid approaches like state space models or routing networks might help.

Modality Aware Architecture Is Key

To move beyond correlation, we need architectures that respect the differences between modalities while enabling shared reasoning.

This means:

Separation at the boundaries
Rich context building in the core
Coupled training for perception and generation
Multiscale, bidirectional, active vision modeling

Real progress in VLM will come not from brute scale alone, but from architectural clarity.

Questions for Reflection

Are we optimizing our architectures for understanding or just for benchmarks?
What new tasks could push models to demonstrate real reasoning, not correlation?