Robots get their ‘ChatGPT moment’

Robots get their ‘ChatGPT moment’

Nvidia unveiled a new platform at CES called Cosmos. It’s a world foundation model (WFM) development platform designed to advance and accelerate Physical AI for robots and self-driving vehicles (which are also, in fact, robots).

Understanding digital twins and physical AI

I’ve written before about Physical AI in general and Nvidia’s initiatives in that space specifically. 

The “Physical AI” concept involves creating complex virtual environments that simulate real-world physics, where digital replicas of robots and systems can learn and optimize their performance. 

For factory robots, as an example, an Omniverse customer would create a “digital twin” of the factory in a virtual reality space. Every detail of the factory floor would be replicated, with the distances between objects exactly the same as in the real, physical factory. Internet of Things (IoT) sensors in the real factory feed data into the twin, keeping it in an identical state.

Crucially, the virtual twin in Omniverse is programmatically endowed with physics — gravity, inertia, friction, and other physical qualities that are applied to anything happening in the twin. Companies can design, simulate, operate, and maintain their factories virtually through twins.  And they can train robots and robotic systems in Omniverse. 

The newly announced Cosmos works in conjunction with — and dramatically  amplifies — the ability of Omniverse robot training through the creation and use of World Foundation Models (WFMs).

What in the world are ‘World Foundation Models”?

If you’re unfamiliar with the phrase “World Foundation Models,” that makes sense, because it’s pretty new and most likely coined by Nvidia. It conjoins the existing (but also recent) concepts of “world models” (AI systems that create internal representations of their environment to simulate and predict complex scenarios) and “foundation models” (AI systems trained on vast datasets that can be adapted for a wide range of tasks).  

According to Nvidia, WFMs are an easy way to generate massive amounts of photoreal, physics-based artificial data for training existing models or building custom models.  Robot developers can add their own data, such as videos captured in their own factory, then let Cosmos multiply and expand the basic scenario with thousands more, giving robot programming the ability to choose the correct or best movements for the task at hand. 

The Cosmos platform includes generative WFMs, advanced tokenizers, guardrails, and an accelerated video processing pipeline. Developers can use Nvidia’s Omniverse to create geospatially accurate scenarios that account for the laws of physics. Then, they can output these scenarios into Cosmos, creating photorealistic videos that provide the data for robotic reinforcement learning feedback. 

Again, a great way to understand this is to compare it with the LLM-based ChatGPT. 

I recently wrote about how Google’s LLM-based tool, NotebookLM, is fantastic for learning something complex. At the time, I described the following use case: 

“Rather than reading advanced material, it’s far faster and more engaging to let NotebookLM’s ‘Audio Overviews’ feature create a life-like podcast for you to listen to. It will create a ‘study guide,’ a FAQ, a ‘briefing guide,’ and a timeline, enabling you to quickly look at dense content from multiple angles, perspectives, and levels. You can start by asking the chatbot to explain it to you like you’re a sixth-grader, then a high school senior, then an undergrad, and on up until you’ve mastered the material.”

In this scenario, you’re “training” your brain by taking an existing data set and telling the chatbot to give you that same data sliced, diced, and re-formatted in eight or more ways. 

This is also how WFMs work, in outline. The developer takes existing training data and feeds it into Cosmos, which creates more training scenarios that are as usable as the original set. They can turn 30 scenarios into 30,000, which the robot uses as if actual trial-and-error learning had taken place. 

Cosmos’s output looks like real-world training data, but it can rapidly train robots in thousands of scenarios. 

Robotic’s ChatGPT moment

Nvidia implies that Cosmos will usher in a “ChatGPT moment” for robotics. The company means that, just as the basic technology of neural networks existed for many years, Google’s Transformer model enabled radically accelerated training that led to LLM chatbots like ChatGPT. 

In the more familiar world of LLMs, we’ve come to understand the relationship between the size of the data sets used for training these models and the speed of that training and their resulting performance and accuracy. 

Elon Musk pointed out recently that AI companies have exhausted human-generated data for training AI models. “We’ve now exhausted basically the cumulative sum of human knowledge…in AI training,” he said. 

Data for training robots is also limited — but for a different reason. Training data in the real physical world is simply slow and expensive. Unlike human-generated text, which has already happened at scale over centuries, robot-training data has to be generated from scratch. 

Likewise, robots and self-driving cars can essentially “learn” how to do their jobs and navigate complex and unfamiliar terrain. Cosmos (working with Omniverse) should dramatically increase the amount of training that can take place in a much shorter time frame.

Driving safety

The idea of testing autonomous vehicles with massive sets of physics-aware data is a vast improvement over how self-driving cars and trucks have historically been trained — which is that they drive around in the real world with a safety driver. 

Driving in the real world with a person as backup is time-consuming, expensive, and sometimes dangerous — especially when you consider that autonomous vehicles need to be trained to respond to dangerous situations.

Using Cosmos to train autonomous vehicles would involve the rapid creation of huge numbers of simulated scenarios. For example, imagine the simulation of every kind of animal that could conceivably cross a road — bears, dear, dogs, cats, lizards, etc. — in tens of thousands of different weather and lighting conditions. By the end of all this training, the car’s digital twin in Omniverse would be able to recognize and navigate scenarios of animals on the road regardless of the animal and the weather or time of day. That learning would then be transferred to thousands of real cars, which would also know how to navigate those situations (with no animals harmed).

If Nvidia is right, and we have arrived at a “ChatGPT moment” for robotics, then the pace of robotics advances should start accelerating, driving major efficiencies and mainstreaming autonomous vehicles on public roads globally for many companies (not just Waymo in a few cities). 

One fascinating aspect of the new generative AI world in which we live is that predictions are futile. Nobody knows how all this will develop. 

And this appears to be true with predictions about how long it will take for everything to become extremely robotic. It’s probably all going to happen much  faster than anyone thinks. 

close chatgpt icon
ChatGPT

Enter your request.