Wednesday, June 10, 2026
HomeRoboticsRobotics will not have a clean Llama moment

Robotics will not have a clean Llama moment


Robotics will not have a clean Llama moment

On a bench not long ago, a small quadruped turned cleanly to the right. The mirrored left turn dragged and lost contact. The legs had landed in different servo regions and loaded the body differently, so the same command did two different things. The code was symmetric; the contact mechanics were not.

The Llama analogy works until the model has to move hardware. The original Llama paper gave software teams a reusable starting point. A team that did not pay for the training run could adapt the model, shrink it, and serve it through a familiar software path. The weights were useful because other teams already had the tools to turn them into running software.

Robot models move the same way, but a robot policy does not travel on its own. A local control stack converts policy output into motion on the installed robot via its controller, within the cell’s safety envelope. Model access will expand what robots attempt. The advantage will come from turning that behavior into supported work on installed systems, with a fault record a technician can use months later.

Robot policies are getting easier to download

Google DeepMind’s Open X-Embodiment project pooled robot data across institutions and robot bodies, and its RT-X results found that training across embodiments improves transfer in some settings rather than forcing each system to learn only from its own narrow dataset.

DeepMind’s newer releases split the work across the robot stack. Gemini Robotics 1.5 is a vision-language-action model that takes visual information and instructions and turns them into motor commands. Gemini Robotics-ER 1.6 sits higher in the stack, handling spatial reasoning and task planning while supporting progress checks and tool calls.

NVIDIA has pushed distribution in the same direction, with GR00T releases and Isaac models moving into developer channels such as Hugging Face’s LeRobot. From a distribution perspective, the Llama story fits in with the idea that capable robot policies are becoming easier for developers to obtain.

Against Crunchbase’s count of nearly $14 billion in robotics venture funding in 2025, the individual rounds stack up fast. Skild AI raised $1.4 billion for an omnibodied robotics model, while Physical Intelligence is reportedly in talks for another $1 billion at a valuation above $11 billion. Yann LeCun’s Advanced Machine Intelligence raised $1.03 billion around a different approach to world modeling, and Wayve closed a $1.2 billion Series D for autonomous driving. Those rounds assume robot intelligence becomes reusable before the industry has proved that the release path works across systems.

OpenVLA is a 7B-parameter open vision-language-action model trained on 970,000 robot manipulation episodes from Open X-Embodiment. Physical Intelligence works the action side through FAST, which converts robot action chunks into tokens. Its openpi repository shows the work that remains once a model is available. A team runs inference, tunes on its own robot data, and then validates the result on the target machine. Even that path carries a hardware bill. The repository lists more than 8GB of GPU memory for inference, 22.5GB for LoRA fine-tuning, and 70GB for full fine-tuning.



ITE AD for the 2026 RoboBusiness call for speakers
Submit your session idea for the 2026 RoboBusiness

Where transfer actually breaks

A robot cell can pass acceptance and run cleanly through most cycles. The harder problem lives in the remaining misses, where small physical changes create a different task than the one the policy saw during tuning.

At customer sites, embodiment transfer usually breaks through ordinary changes. Camera geometry and end-effector compliance change after sign-off, fixture datums move with the customer process, and contamination builds over weeks of shifts before recovery behavior becomes unreliable. Site drift is the mismatch between the robot that passed acceptance and the robot operating within the customer’s process.

Domain randomization trains across many simulated variations, but the floor keeps introducing new ones every day. A command can preserve the same high-level intent and produce a different result when contact moves through another load path. One side of a mechanism can push through the frame differently, so a motion that works in one direction can create drag, rocking, or lost contact in the other. When that happens, smoothing the command will not fix behavior whose real failure is timing.

Embodiment-aware models reduce one source of the issue by representing a robot’s hardware through kinematics, joint attributes, prompts, or tokens. A policy that accounts for joint limits and actuator dynamics starts from a better description of the system. Some unknowns become measured parameters, but the measurement starts aging as soon as the robot goes into production. Friction changes, tooling wears, and loads vary by process. Recovery motions can also create states that the original calibration did not account for. Better hardware models make a rollout more diagnosable without making it generic.

On a real line, the first check is often mundane. The team compares the last good cycle with the faulted one before blaming the policy. The change shows up in pose, in current draw, or in the fixture datum around the task. The model may be producing exactly what it produced during acceptance testing, while the local task has moved away from the data that trained it.

The useful data comes after the miss

Robot data carries a different burden from language data. Bessemer Venture Partners has estimated the total global robot manipulation data at roughly 300,000 hours, compared with roughly 1 billion hours of internet video and 300 trillion tokens of text. Language models could draw on the internet. Robots have to build most of their corpus from deployed machines.

NVIDIA is trying to widen that corpus from another direction. It says GR00T N1.7 was pre-trained on more than 20,000 hours of human egocentric video rather than robot teleoperation, betting that first-person human footage carries useful manipulation priors.

An equally important part of the dataset is the failure context, which includes the controller state, recovery action, and physical cause. A camera may show that the robot missed, but it may not explain why the gripper lost the part or why the safety stop fired. It may also miss which recovery move got the cell running again. Logs fail another way when they separate from the physical event. A log can show progress against a narrow control metric while the robot is visibly dragging into the task. It can accumulate the number software wants while producing behavior that would be unacceptable to a customer. Logs earn their keep only when the team can match them to what happened in the cell.

Teleoperation and simulation can generate data before a system reaches the floor, but the best record comes from instrumented robots running customer processes with enough context to diagnose failures afterward. A company that turns fault history into safer recovery motions learns more from each installation than one that stores clean success videos. The technician has to separate a policy failure from a slipped tool, a shifted fixture, or a recovery path that worsened the next cycle.

Simulated futures meet contact

World models are intended to test decisions before hardware is at risk. World Labs’ Marble builds 3D worlds from prompts or visual inputs and exports them into formats for simulation and design review. In autonomous driving, Wayve’s GAIA-3 follows a similar path as a 15-billion-parameter world model for realistic, controllable offline evaluation of self-driving AI.

World Action Models pull world modeling closer to control. DreamZero defines the architecture as a model that predicts future world states and actions from video. NVIDIA previewed GR00T N2 on that research, claiming it succeeds at new tasks in new environments more than twice as often as leading VLA models and ranks first on the MolmoSpaces and RoboArena benchmarks. NVIDIA says N2 is expected later this year.

The generated action has to pass through the controller before it becomes motion. Driving is constrained by road geometry and vehicle dynamics. Manipulation introduces direct contact, and contact introduces failure modes that are harder to capture cleanly in simulation. Force closure can be wrong, seals wear, and calibration can drift slowly enough that the line keeps running until it stops repeating.

Simulation becomes more useful when friction, actuator response, center of mass, and rate limits are measured rather than assumed. Even then, the team keeps the simulator calibrated against the hardware and watches for the point at which the real system has drifted beyond the model. A measured simulator shrinks the search space before anyone touches hardware, though it cannot substitute for the check against the system performing real tasks.

The controller is where the claim gets tested

Model output reaches the world through the controller. Agility Robotics has described a whole-body control model for Digit. The model is a small LSTM with fewer than 1 million parameters, trained in NVIDIA Isaac Sim for decades of simulated time over several days.

Many VLA policies operate at the cadence of task-level actions or action chunks. A common industrial servo loop closes around 1 kHz. A model output becomes useful only after the controller turns it into executable motion within the robot’s limits. The motion architecture can decide the outcome before the controller rejects a command. A path built from clean poses can carry pauses or bad contact timing. In cyclic motion, continuous phase can do more for the result than pose polish, and a recovery move that looks conservative in command space can arrive late at the contact point.

Even a conservative post-processing step can open a new failure by shifting contact timing or delaying a recovery into a worse part of the robot’s local dynamics. Filtering can make the command look cleaner while placing the foot or tool late, much like the quadruped’s mirrored turn looked symmetric in code and dragged in contact. For industrial systems, the safety layer also defines what the learned layer may do when the model is uncertain or the machine state has changed.

Reuters reported that Skild AI’s model would run on Foxconn assembly lines in Houston, where NVIDIA Blackwell GPU server racks are built. NVIDIA has also described Skild’s work with ABB Robotics and Universal Robots. The test is broader than a startup model on a single line, because a shared intelligence layer must span established robot portfolios without sacrificing the service and safety discipline those platforms require.

The open question is not only which model performs best. It is whether reuse comes first from the action representation, the controller interface, the simulation path, or the diagnostic trail.

What cannot be downloaded

A real robotics Llama moment would need a team to take a model, adapt it to its action representation, validate it on its robot, and deploy useful behavior without support from the original model team. Driving has shown that fleet learning can run across many vehicles, with Waymo reporting safety analyses covering over 170 million fully autonomous miles. Manipulation has more friction because workcells and products vary more than road geometry, and customer data rights limit what anyone can collect and reuse.

The platform that wins distribution could try to pool fault data across every install, the way driving fleets pool miles. Robot manipulation data is challenging because each customer’s process is unique, contracts fence off the data, and a fault on one line often does not generalize to the next. The corpus stays fragmented even when the model does not.

The Llama moment in robotics will not be the day a policy becomes downloadable. It will be the day another team can take that policy, adapt it to its robot, release it into a customer process, and still know what failed weeks later when the line stops repeating.

Deepak Jayaraj.

About the author

Deepak Jayaraj is the vice president of hardware engineering and manufacturing at Four Growers, an agricultural robotics company based in Pittsburgh. With over 15 years of experience spanning space robotics, medical devices, and AgTech, he specializes in guiding robotics companies through the critical transition from prototype to scaled deployment and the economics of hardware business models.

The post Robotics will not have a clean Llama moment appeared first on The Robot Report.

RELATED ARTICLES

Most Popular

Recent Comments