After LLMs and agents, the next AI frontier: video language models
Tesla’s viral videos show its Optimus humanoid robot serving drinks to guests — a glimpse of AI in the real world that a new AI innovation, called world models, is expected to make more reliable. (For one, humanoid robots will do a better job in navigating and serving people their custom drinks.)
World models — which some refer to as video language models — are the new frontier in AI, following in the footsteps of the iconic ChatGPT and more recently, AI agents. Current AI tech largely affects digital outcomes, but world models will allow AI to improve physical outcomes.
World models are designed to help robots understand the physical world around them, allowing them to track, identify and memorize objects. On top of that, just like humans planning their future, world models allow robots to determine what comes next — and plan their actions accordingly.
“If you think about how generative AI started…, the difference with world models is that it needs to know what is actually possible,” said TJ Galda, Nvidia’s senior director of product management for Cosmos, a world model.
Beyond robotics, world models simulate real-world scenarios. They could be used to improve safety features for autonomous cars or simulate a factory floor to train employees.
World models pair human experiences with AI in the real world, said Deepak Seth, director analyst at Gartner. “This human experience and what we see around us, what’s going on around us, is part of that world model, which language models are currently lacking,” Seth said.
Though today’s AI models and large language models (LLMs) can’t fathom beyond the digital realm, world models will make human and AI collaboration possible in the physical world. (The humanoid robot population could reach 1 billion by 2050, Nvidia said, citing a recent Morgan Stanley study.)
In addition to Nvidia’s real-world model Cosmos, Google’s DeepMind has developed a world model called Genie 3. World models use complex mathematics and physical simulations to help robots comprehend, anticipate, and plan real-world actions, such as navigating a room or loading a dishwasher.
Cameras and sensors provide robots with raw visual and physical information about their surroundings. World models can then blend with multimodal systems to interpret visual or image-based commands before getting to work.
“In physical AI, this model would have to capture the 3D visual geometry and physical laws — gravity, friction, collisions, etc. — involved in interacting with all types of objects in arbitrary environments,” said Kenny Siebert, AI research engineer at Standard Bots.
World models then help robots understand and evaluate the consequences of ththeyactions it may take. Some world models generate short video-like simulations of possible outcomes at each step, which helps robots choose the best action.
“I think the difference with world models is [that] it’s not enough just to predict words on a sign or the pixels that might happen next, but it has to actually understand what might happen,” Galda said. For example, a robot could read signs such as “stop” or “dangerous zone” on a factory floor or the road and understand it has to be extra cautious moving forward.
“If you’re building a car or a robot or something that has to take AI into the physical space amongst people, you need to be extremely sure it’s safe and understand what it will do,” Galda said.
World models are one of several tools that will be used to deploy robots in the real world, and they will continue to improve, said Kenny Siebert, AI research engineer at Standard Bots.
But the models suffer from similar problems — the hallucinations and degradation — that affect the likes of ChatGPT and video-generators. Moving hallucinations into the physical world could cause harm, so researchers are trying to solve those kinds of issues.
A new general world model called PAN helps robots run “thought experiments” and test more action sequences in a safe, controlled simulation. PAN’s model builds an internal memory and maintains a level of coherency of how scenes should change.
Robotics isn’t the only game in town for PAN, which was created by researchers at the Mohamed bin Zayed University of Artificial Intelligence. It could also be used in autonomous driving, safety simulations, and long real-world simulations that “predict, and reason about how the world evolves in response to actions,” the researchers said in the paper detailing PAN.
PAN takes a cue from human behavior to first imagine, then visualize, then plan actions, working to understand the cause-and-effect of an action before seeing how it looks in a video. Typical actions use input of visual frames and natural language.
PAN then generates longer and more coherent video simulations and is designed so the simulated scenes stay consistent over time rather than drifting into unrealistic outcomes.
In contrast, current video-generation models don’t track cause-and-effect or hold steady over time and structure. They lose consistency over long simulated sequences.
“Existing video generation models typically produce single, non-interactive video segments,” the researchers said in the paper.
Some video-generation models include Google’s Veo-3 video and OpenAI’s Sora, which the company views as a “world simulator.”
“In contrast, PAN shows superior capacity to precisely simulate action-driven world evolution” compared to other video generators and open-source world models, the researchers said.
The key PAN breakthroughs include a Generative Latent Prediction (GLP) capability, which allows the model to imagine and visualize future states. Structural upgrades — which researchers call Causal Swin-DPM — keep videos coherent over time, while reducing noise and uncertainty.
World models will only get better over time, Standard Bots’ Siebert said. “We see several potential use cases including evaluation in simulation, long-tail training data generation, and distillation to smaller hardware-constrained models. As world models progress, we expect the list of use cases to grow beyond what we can foresee today.”
Read more: After LLMs and agents, the next AI frontier: video language models