Shared Space Is Not Enough
As far as I can tell, one of the main challenges in making truly intelligent VLMs boils down to how we handle the differences between text and images inside the model. We've been scaling models and feeding them tons of data, and yes, they improve on certain tasks. But they still feel fragile, more like they're guessing based on correlations than really understanding.
1 The Core Tension
Text and images are fundamentally different. Their data structures and demands don't line up neatly, yet current models try to cram them into the same shared space.
The problem is that each modality ends up competing for the limited resources in that space to serve its own needs. This mismatch causes tension.
Consider the final hidden state in a transformer:
token → layers → final hidden state
By design, it compresses everything into a representation optimized for predicting the next text token. Expecting it to also serve as a rich spatial descriptor for images feels like a fundamental mismatch.
2 Modality Separation at the Boundaries
The solution may start with separation. Instead of merging inputs too early and splitting outputs too late in a single backbone, we should:
- Insert dedicated transformer blocks at the modality boundaries
- Allow each modality to process its inputs before entering the shared space
- Decode outputs using separate blocks, not just projection layers
Why transformer blocks? Because mapping between raw modalities and abstract shared space is complex. It's not a linear transformation, it's computation heavy and needs deep context mixing.
This gives us a structure like:
text/image → modality-specific blocks → shared core → decoder-specific blocks → output
3 The Role of the Shared Core
If we build a middle "abstract" space, what should it actually do?
For image tokens, a few architectural shifts are essential:
3.1 Bidirectional Attention on Image Tokens
Applying causal attention masks to image patches makes little sense. Vision isn't sequential.
Instead:
- Let every image token attend to every other token
- Avoid directional bias
- Treat image understanding as spatial reasoning
This turns image processing into an active reasoning task where each token gathers context from everywhere else, building a global understanding rather than just passing features forward.
3.2 Masked Training for Active Image Understanding
Masked Autoencoders showed that hiding parts of an image forces models to build richer representations. Instead of passive encoders feeding image features to a text loss, we need to:
- Replace patches with learned mask tokens
- Train the model to decode masked regions
- Encourage deeper image-specific context learning
This gives vision its own task-agnostic training signal, parallel to language modeling.
3.3 Multiscale Vision Tokens
Images operate across scales:
- Local texture
- Mid level shapes
- Global structure
A robust model must predict masked patches at multiple resolutions simultaneously, training the encoder end-to-end.
Input image → patch + scale → masked token prediction → context learning
This avoids reliance on pretrained encoders and supports emergent alignment like CLIP.
4 Connecting Understanding to Generation
Here’s where things get exciting.
By using the same shared core and masked prediction objective, we can unify image understanding and image generation.
- Add a diffusion model head to the shared space
- Use the same latent features for both tasks
- Train on both masked understanding and image generation
This symmetry helps the model learn representations that serve both perception and synthesis.
5 The Trick: Dual Input During Training
Add both the masked and unmasked image at every scale. Yes, it doubles the tokens. But:
- It gives the model full ground truth to learn from
- It allows comparison between prediction and reality
- It enables self correction by detecting inconsistencies
During generation, the model can feed its own output into the next resolution step. That feedback loop allows refinement.
6 Benefits of This Architecture
What does all this buy us?
- Stronger understanding from active training, not passive feature reuse
- Better generation by tying perception to synthesis
- Surprisal based reasoning by masking and comparing outputs
- Visual chain-of-thought from deeply aligned modality interaction
Instead of guessing from correlations, models may start reasoning across images and text more robustly.
7 Open Challenges
This isn't free. Several technical challenges remain:
- How to interleave text and image streams efficiently
- How to scale bidirectional attention across long image sequences
- How to initialize special tokens meaningfully in this architecture
Hybrid approaches like state space models or routing networks might help.
Modality Aware Architecture Is Key
To move beyond correlation, we need architectures that respect the differences between modalities while enabling shared reasoning.
This means:
- Separation at the boundaries
- Rich context building in the core
- Coupled training for perception and generation
- Multiscale, bidirectional, active vision modeling
Real progress in VLM will come not from brute scale alone, but from architectural clarity.
Questions for Reflection
- Are we optimizing our architectures for understanding or just for benchmarks?
- What new tasks could push models to demonstrate real reasoning, not correlation?