Wait, can AI models actually generate video with audio together now?

I just heard about some new AI technology that can create videos with sound at the same time. Is this actually real? I thought you needed separate models for video generation and audio generation, then somehow combine them together afterwards.

Can someone explain how this works? Are there really AI systems now that can produce both the visual and audio parts of a video in one go? This seems pretty crazy if it’s true. What kind of quality are we talking about here?

I’m curious about which companies or research groups are working on this stuff. Has anyone tried these tools yet?

Yeah, this tech definitely exists and it’s moving fast. I’ve been tracking it for about a year now, and there are models that generate video and audio together instead of making them separately and trying to sync them up later. The breakthrough is in multimodal training - the AI learns how visuals and audio connect during training rather than treating them as completely different things. It gets that when you see something happen, there’s usually a specific sound that goes with it. Meta’s doing a lot of work here, plus other big tech companies. Quality really depends on what you’re making and how long it is. Some short clips look genuinely impressive, but longer videos still drift out of sync. Works best with stuff like music performances or nature footage where the audio-visual connections are pretty predictable. Complex dialogue and detailed sound design? Still tough. But honestly, the progress just in the last six months has been huge.

Actually played around with these systems in production last month when we were exploring automated content generation options.

The concept’s pretty straightforward. These models use joint embedding spaces - they process video frames and audio spectrograms simultaneously during training, learning how they relate statistically. When you generate content, both parts get created together instead of being stitched afterward.

Runway and Stability AI have decent implementations you can use right now. Results vary wildly though. I generated about 50 test clips for our team.

What works: basic environmental sounds, simple instruments, generic speech. The model gets that rain makes specific sounds or guitar strings create predictable audio.

What breaks: anything needing precise timing, complex layered audio, or specific voice characteristics. I tried generating someone typing and the key presses were completely off from the finger movements.

Biggest limitation I noticed is duration. Anything over 10-15 seconds shows serious drift - the model loses coherence between what you see and hear.

We ended up sticking with our traditional pipeline for client work. But for rapid prototyping or placeholder content, these tools are actually pretty useful now.

I work in post-production and, yes, this has been on my radar lately too. The tech exists, but it’s nowhere near as smooth as people think. These models are trained on huge datasets that pair video with audio, allowing them to pick up on connections between what you see and what you hear. OpenAI and Google have both showcased this capability in their newest models. Instead of generating video first and then adding audio, these systems attempt to predict both simultaneously. However, there are significant issues currently. The audio tends to be quite bland, consisting of basic ambient noise, simple music, or very straightforward speech. I’ve experimented with some research demos, and while the sync is quite impressive for simple sounds like footsteps or water, anything requiring precise timing or intricate audio remains well below the professional quality needed. Most of us are still reliant on separate specialized models because the current quality isn’t adequate for client work.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.