When will multimodal generations truly bamboozle me?

In other words, when will those AI-generated videos trick me so thoroughly that I can watch something longer than five minutes and think it was made with real script written by people, real actors, real scenes, and the typical non-AI computer generated imagery on top?

TLDR?

Video generation is held back for these reasons:

  • We don't have a way to extract, preserve, and selectively reproduce context from the latent space.
  • We don't have a way to transform specific attributes within the latent space over time.
  • Video models are gaming their reward algorithms to make the video equivalent of textual fluff.
  • We need a breakthrough in context and attention.
  • Tool synchronization.

I have honestly been bamboozled by the short form content. That will likely continue on for a very long time. Some of it is believably entertaining — like a squirrel eating spicy ramen, followed by gasping and going back for more — while others are comically absurd like an elder opening a door to a "cute little tank" that blows their house up. Or the worst kind: literal misinformation on wars, conflicts, disasters, and emergencies.

It has been easy to bail out of long form generated videos that fall into these categories:

  • Slop "edutainment" videos that talk about nonsense like the Egyptian pyramids generating electricity from the sun. They also target children too. face-vomiting
  • Engagement farming content that skirts or violates fair use by telling a compressed story on top of movie or television show clips.
  • A podcast with one, two, or more robot personalities discussing something so new and recent that there are few search results for the topic. There's no real graphical content, it's just voices over a logo.
  • An "educational" video of slides with a robot speaker. The slides are generated with Nano Banana Pro which has a distinct default style. And there's a new one every forty minutes for days by that account.

Edutainment or other forms of slop-video targeting children were already a problem before AI affordability and capability democratized access to creating vast amounts of slop. The rest similarly follow now that anyone can generate a book to compete with the sales of the original in hours or days on Amazon.

The common signal for these cases are often low engagement numbers and high volume of content. Being a more meta-signal than the content itself, let's focus on what makes it so easy to spot.

The most obvious signal to everyone is how much fluff the language has, in both written and verbal form. When reading or listening, it's like there's activation in the head without consistent meaning being received and understood.

For anyone that's dabbled with agents, you learn to read over fluff quickly. Speech, however, cannot be skimmed. The same goes for video content. As a viewer, I am burdened with processing information without synthesizing significant meaning from the experience. It drains me.

Reasoning models do better at this. Some of that fluff is kept within the <|reasoning|> tokens.

Videos, for whatever reason, consistently skip on the writing quality. Especially if the video is concurrently generating the script it speaks. Which leads to another identifier: consistency over time in both scripting and visuals.

Until we have composable latent spaces, I think we'll continue to experience Harry Potter Balenciaga from 2023:

This specific case was likely made with a combination of Midjourney, LivePortrait, and ElevenLabs. Easy to spot today after seeing the robotic content for long enough that the novelty and worry over deep-fakes died down.

In fact, it became a certain YouTuber's style except with Unreal Engine's live capture capabilities locked to just the head

Except in 2025, Sora and Veo actually delivered coherent short-form video. That is, more than a puppeted head on a still image, for a short range of time.

Videos by Veo, Sora, Seedance, Kling O3, Grok Imagine Video, they all have the same problem: they are only able to be consistent within their context window. Veo can extend from one video sample to another or within one, though the only technique seems limited to the artifacts at the start and / or end where it must be coherent with another video shot. The latent space of character design and their constraints is lost with each generation.

In the 2026 remake, which I suspect used Veo and Suno, Harry Potter Balenciaga is far more consistent in shot to shot visually with the character reference (first seen in the original 2023 video). However, as you can see in the background, the photographers are not consistent from shot to shot. There's no latent space of thing and place that is shared yet flexible when the environment should change.

I believe the next advancement that will make video more convincing is consistency on details of the world.

In addition to the world details, a few more specific examples can be heard and seen in the 2026 remake.

For example, the generated voices vary between scenes for the same character. One can say "This character is British, old guy, kind of like Dumbledore" and get a facsimile of Dumbledoor. Yet from scene to scene, even with the same character reference, the voice can change entirely. The words "kind of like Dumbledore" don't have the depth to reliably recreate whatever parameters go into the voice generation, and I expect that we never will have consistent text to vocal-parameter generation. The accent of a character shouldn't drift over time.

Also the voices sounded robotic, though I expect that to improve over the next year given what we're seeing places like Qwen-TTS. The lip syncing was especially bad too, which makes me think ElevenLabs was used again on top to get the "like harry potter actors" voices.

Vocal-consistency and environmental consistency are related, to improve on both we need latent space extraction and composition. In the same way that games have 3d models, acoustic models (for environments), light models (or at least, approximations through shaders), video generation needs an intermediate way to encode, preserve, and reproduce its own assets.

While this may be in part DemonFlyingFox's artistic vision, constant zoom-in is video generation's analog to text-generation's fluff. Real video does have zoom-in — just not this much.

Video models have reward model that guide the model towards generating coherent graphical changes over monotonic time. Zooming in is low risk and likely satisfies another reward for novelty in the video over time. While zooming out or panning requires the generation of more information in the scene that must be consistent as the scene progresses. This is fundamentally harder work to train and reward.

The next step in video generation is to figure out a cost model that penalizes cinematic specification gaming.

The last thing affects text models too, especially code at longer time horizons. The further along, the less connected the pieces feel until it is truly just a jumble of distracted pieces. The scripts drift. Perhaps, like code, a harness for script writing could be designed though it must be amenable to judicious tweaks over time and out of order.

Attention is quadratic and still context rot is a problem. Something about context and attention must improve to handle larger contexts to be consistent paced, consistently interesting, and intentionally (which is hard to argue when it comes to an algorithm) navigating the highs and lows of telling a story.

Uncle from Jackie Chan cartoon "One More thing"

One more thing... Videos will always be a process that compose many simultaneous graphical and auditory inputs over one another. We will never have the one model to rule them all. A workflow of tools that expect to work with and synchronize with each other is necessary for long form video overcome the uncanny valley.

But also, "should we?" put more into this technology? The people of Hollywood have made their stance clear. Entertainment is fundamentally a human activity to bring joy to other humans. AI generated material only brings engagement in the disguise of entertainment.

I go through this exercise because it's an interesting perspective shift to realize why Claude claude Opus keeps making the same mistakes again and again. Everything is markdown. There's no latent space preservation that can be extracted and version controlled over time. Maybe it will be a thing in two more years. Who knows.

Footnotes

  1. Reward models guide training or reinforcement learning without active human involvement towards a measurable goal.