Embodied AI with Two Arms: Zero-shot Learning, Safety and Modularity
IROSApr 4, 2024Best RoboCup Paper
We present an embodied AI system which receives open-ended natural language
instructions from a human, and controls two arms to collaboratively accomplish
potentially long-horizon tasks over a large workspace. Our system is modular:
it deploys state of the art Large Language Models for task
planning,Vision-Language models for semantic perception, and Point Cloud
transformers for grasping. With semantic and physical safety in mind, these
modules are interfaced with a real-time trajectory optimizer and a compliant
tracking controller to enable human-robot proximity. We demonstrate performance
for the following tasks: bi-arm sorting, bottle opening, and trash disposal
tasks. These are done zero-shot where the models used have not been trained
with any real world data from this bi-arm robot, scenes or workspace. Composing
both learning- and non-learning-based components in a modular fashion with
interpretable inputs and outputs allows the user to easily debug points of
failures and fragilities. One may also in-place swap modules to improve the
robustness of the overall platform, for instance with imitation-learned
policies. Please see https://sites.google.com/corp/view/safe-robots .