Vision-language-action (VLA) model

A vision-language-action (VLA) model takes camera images and a natural-language instruction and outputs robot actions, extending vision-language models from text output to physical control. VLAs are trained on large collections of demonstration episodes, so their performance depends heavily on data volume, diversity and annotation.

Notable examples include Google DeepMind’s RT-2 and the open OpenVLA. Human egocentric demonstrations are increasingly used to pre-train or augment VLA policies because they are cheaper and more diverse than robot teleoperation.

VLA models and the data they need

← All terms