Moving Out: Physically-grounded Human-AI Collaboration

Abstract

The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce Moving Out, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration.

The Moving Out Environment

The Moving Out environment requires two agents to collaboratively move objects to the blue goal regions Each map includes movable objects with varying shapes and sizes. An agent can move a small item quickly. As the object size increases, the agent needs the other's help to move the object. To succeed, the agents need to demonstrate abilities in engaging in diverse collaboration behaviors, including (a) recognizing when help is needed, (b) avoiding collisions, (c) passing objects, (d) moving items together, (e) aligning actions, and (f) organizing objects in the goal region.

Dataset Visualization

We designed two tasks: (1) adapting to diverse human behaviors and (2) generalizing to unseen physical constraints in Moving Out and collected human-human interaction dataset to enable model training and evaluation. You can explore our human-human interaction dataset by selecting different tasks, maps, and episodes below. The dataset contains collaborative episodes across twelve different map configurations, with each map presenting unique collaboration challenges.

Select Video

Task

Map

Episode ID

Currently viewing: Task 1 - Corner Decision - Episode 0

BASS: Behavior Augmentation, Simulation, and Selection

Our method, BASS (Behavior Augmentation, Simulation, and Selection), consists of two main components designed to address the challenges of human behavioral diversity and physical constraints.

Method Overview

The first part is behavior augmentation. It enriches the training data through two techniques, allowing the model to better adapt to diverse human behaviors:

• Perturbing the Partner's Pose: We generate new states by adding random noise to the partner's pose data. This improves the model's robustness to subtle variations in the partner's movements.

• Recombination of Sub-Trajectories: We create new and temporally consistent interaction sequences by identifying and swapping segments of a partner's behavior between different recorded trajectories.

The second part is simulation and selection. We train a dynamics model to predict the state that results from an action. During inference, our model generates multiple candidate actions and uses this dynamics model to simulate their respective future outcomes. It then selects the optimal action based on an evaluation function (for example, the distance of an object to its goal). This allows our model to anticipate the consequences of its actions and make the best choice, even without access to a physical simulator.

Overview of our Simulation and Action Selection components. (Left) The latent dynamics model that encodes the latent state from t to t+1 to enable next state prediction. (Right) The action selection pipeline: The policy first generates candidate actions. The dynamics model then estimates the resulting future states, and finally, the best action is selected based on state evaluation.

AI-AI Collaboration Videos

BASS (Behavior Augmentation, Simulation, and Selection) enhances the diversity of agents and their understanding of the outcome of actions. Our experiments show that BASS outperforms state-of-the-art models in AI-AI and human-AI collaboration.

MLP

DP

BASS (Ours)

Human-AI Collaboration Videos

Here we demonstrate Human-AI collaboration where the red robot is controlled by a human and the blue robot represents our AI agent. The comparison shows how BASS enables better coordination and collaboration with human partners compared to the baseline DP method. Our failure case analysis shows that the baseline model (DP) often fails when adapting to human behaviors (like not assisting) or objects (like failing to grasp) that were not in the training data. In contrast, our method, BASS, reduces the occurrence rate of these primary failure types by about half, demonstrating superior adaptability.

DP

BASS (Ours)

BibTeX


  @misc{kang2025movingoutphysicallygroundedhumanai,
          title={Moving Out: Physically-grounded Human-AI Collaboration}, 
          author={Xuhui Kang and Sung-Wook Lee and Haolin Liu and Yuyan Wang and Yen-Ling Kuo},
          year={2025},
          eprint={2507.18623},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2507.18623}, 
    }