# A Drive in the Office

## Useful Robots

It’s 2024 and the only useful robots you can buy are still coffee machines and roombas (also a third secret thing). Hardware is becoming cheap and ubiquitous while capabilities are still largely limited by software. The comma body is a simple robotics devkit, can we make it do something useful? Our office security system cost ~$17k, a body is$999, can we make it watch over our office?

A few months back I flew from Paris to warm San Diego to attend COMMA_HACK 4. Participants had to get the body to navigate around the office.

My team chose to go for a sim2real approach. Simulation allows for cheap and fast data collection, as well as in the loop testing. However it requires prior knowledge about the robot and the world, and handcoding every possible scenario. Policies learned in simulation are also notoriously hard to make robust to real world noise. See how our sim doesn’t capture the robot wobbling on uneven ground in the video above?

We want a general algorithm that can learn from real data on any robot, one that scales with compute.

## Learn Everything

At comma we’re learning a driving simulator and a controls simulator. Let’s use the same techniques and learn a body/office simulator.

We drove the body around the office using bodyjim and collected about ~3 hours worth of (image, action) pairs, some of which was released in this dataset. We trained a ~90M params vanilla GPT to autoregressively predict the next (vqgan tokenized) videos frames, wheel speeds and actions. Since we want the model to plan ahead we added a MLP head to predict an action plan in the future. This is partly inspired by the Aloha [1] architecture.

Looking at those rollouts, the model has learned the existence of walls and objects, as well as basics dynamics. It has also learned more subtle things such as the robot rocking forward when stopping. Important real world noise has been learned.

# Test & Deploy

We can use our GPT world model both as a simulator and a policy. Give it a sequence of past observations $o$ and actions $a$ of length $k+1$ and get the next observation, or give it a sequence of past observations and actions and the current observation and get the desired action, or let it imagine a whole rollout (actions + observations). $\textit{simulator mode:}$ $(o_{t-k}, a_{t-k}),...,(o_{t}, a_{t}), \rightarrow o_{t+1}$ $\textit{policy mode:}$ $(o_{t-k}, a_{t-k}),...,(o_{t}, ), \rightarrow a_{t}$ $\textit{rollout mode:}$ $(o_{t-k}, a_{t-k}),...,(o_{t}, a_{t}), \rightarrow (o_{t+1}, a_{t+1})$

Also note that these observations can be multimodal, here they are images and wheel speeds measurements.

Using the model as a policy, we assume we want to clone the behaviour demonstrated in the data. To evaluate how it would perform in the real world we collect test cases where the robot should perform known actions (e.g avoiding a wall or going straight). It then imagine rollouts and we compare the imagined actions to the known good actions.

The action plan is fed to a simple differential drive model to compute a geometric trajectory. This allows to visualize where the robot wants to go, without having to bake prior knowledge into the model.

Using the t0 action of the plan does not work well to control the robot. The best action to execute likely lies somewhere else on the trajectory to compensate for data and controls lag. We simulated more rollouts and found action t+2 to empirically work best.

To deploy, we run this model on a RTX3080 at 5Hz and use bodyjim to teleop the body with low latency.

We release the model (GPT+tokenizer) as part of bodyjim. Try it out!

# We’re hiring!

Like what you see here? We’re hiring! You might also like our world-modelling challenge.

Armand du Parc Locmaria,
Research Engineer @comma.ai

[1] Tony Z. Zhao, Vikash Kumar, Sergey Levine, Chelsea Finn “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” RSS 2023.

Updated: