Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

Author Names Omitted for Anonymous Review.

Overview of RoboDual. Our objective is to develop a synergistic dual-system framework which supplements the generalizability of large-scale pre-trained generalist with the efficient and task-specific adaptation of specialist. (a) The fast specialist policy obsesses real-time and accurate control by aid of the slow yet generalized outcome from the generalist one with large-scale data. (b) RoboDual exhibits significant improvement in terms of performance and efficiency over a single standalone option and surpasses previous state-of-the-arts in the real-robot setting.

-->

Real-world Visuomotor Control Tasks

Success Rates of Real-World Experiments

All real-world experiments are conducted with an AIRBOT Play robotic arm featuring a 7-DoF action space and a third-view RGB camera. We evaluate different policies on both single-instruction tasks ("Lift the Pod Lid", "Pour Shrimp into Bowl", and "Push the Block Left") and multi-instruction tasks ("Put [object] into Basket" and Knock [object] Over").

Single-instruction Tasks

Push block left

Pour shrimp into bowl

Lift the pod lid

Multi-instruction Tasks

Robodual shows strong instruction-following ability and excels at multi-instruction tasks.

Task 1: Put [object] into the basket

Put banana into basket

Put carrot into basket

Put eggplant into basket

Put banana into basket

Put carrot into basket

Put eggplant into basket

Task 2: Knock the [object] over

Knock the stuffed bear over

Knock the stuffed egg over

Knock the stuffed kitten over

Generalization Experiment Setting

Generalizability Evaluation in the Real World

Position Variation

*Regrasping* at position #1

*Regrasping* at position #2

Position #3

Position #4

Visual Distractor

Original

Changed

Unseen Background

Original

Changed

Novel Object

Put banana into plate

Put eggplant into plate

More Generalization Experiments

The following experiments are conducted with a NVIDIA RTX 4060 laptop GPU with only 8GB memories. We perform 4-bit quantization to OpenVLA and our generalist model to fit in the device. Specialist of RoboDual can still run at full precision.

Put Block into Bowl (Original)

Human Intervention

Various Distration + Novel Object

Video Playing

Inference Efficiency Comparison with OpenVLA

Robodual achieves a control frequency of 15 Hz in our real-world setup using NVIDIA A5000 Ada GPUs, facilitating deployment in more dexterous tasks. Notably, inference latency is a primary factor contributing to the performance degradation of OpenVLA. Operating at only 3.9 Hz within our system, it significantly alters the system dynamics compared to the 20 Hz non-blocking controller used in our real-world tasks.