A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery

1 DANiLab, School of Computing and Mathematical Sciences, University of Leicester, Leicester 2School of Engineering, University of Leicester
Interpolate start reference image.

Abstract

We propose a novel system for robot-to-human object handover that emulates human coworker interactions. Unlike most existing studies that focus primarily on grasping strategies and motion planning, our system focus on 1) inferring human handover intents, 2) imagining spatial handover configuration. The first one integrates multimodal perception—combining visual and verbal cues—to infer human intent. The second one using a diffusion-based model to generate the handover configuration, involving the spacial relationship among robot’s gripper, the object, and the human hand, thereby mimicking the cognitive process of motor imagery. Experimental results demonstrate that our approach effectively interprets human cues and achieves fluent, human-like handovers, offering a promising solution for collaborative robotics.

Video

Methods

Our method is structured into three key stages: handover intents inferring, handover configuration imagery, and handover pose matching.

Methodology

- Handover Intents Inferring

We propose a handover intent recognition method using MLLMs to interpret verbal instructions and visual gestures. Our model processes a text description and an image of the receiving hand to identify the tool and hand type, generating structured outputs like “Pass the game controller to the right hand of the human.”

LLM - Handover Intents Inferring

- Handover Configuration Imagery

We introduce a diffusion-based generative model for 3D handover configuration, conditioned on text prompts derived from the recognized intent. This approach enables the generation of spatial relationships among the robotic gripper, the object, and the human hand.

Handover Configuration Imagery.

- Handover Pose Matching

We propose a matching approach to align the generated hand with the real-world hand, preventing contact between the hand and the robotic gripper to ensure the safe handover.

Handover Pose Matching.

Experiments

The tests were conducted in both simulation and real hardware. The proposed robotic handover system achieves 80%–93% success rates in real and simulated environments. The total execution time is 9–10 seconds, with 1/3 spent on configuration generation and 2/3 on execution.

Interpolate start reference image.

- Generated Results

We applied the proposed generation method to various daily household items, producing results that define the spatial relationship between the robotic gripper, the object, and the human hand. The following are the visualization results.

BibTeX

@misc{zhang2025generativerobottohumanhandoversintent,
    title         = {A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery}, 
    author        = {Hanxin Zhang, Abdulqader Dhafer, Zhou Daniel Hao, Hongbiao Dong},
    year          = {2025},
    eprint        = {2503.03579},
    archivePrefix = {arXiv},
    primaryClass  = {cs.RO},
    url           = {https://arxiv.org/abs/2503.03579}, 
}