Learning from Physical Human Feedback:
An Object-Centric One-Shot Adaptation Method

Alvin Shek    Bo Su    Rui Chen    Changliu Liu
Carnegie Mellon University
Video Submsision to ICRA 2023

For robots to be effectively deployed in novel environments and tasks, they must be able to understand the feedback expressed by humans during intervention. This can either correct undesirable behavior or indicate additional preferences. Existing methods either require repeated episodes of interactions or assume prior known reward features, which is data-inefficient and can hardly transfer to new tasks. We relax these assumptions by describing human tasks in terms of object-centric sub-tasks and interpreting physical interventions in relation to specific objects. Our method, Object Preference Adaptation (OPA), is composed of two key stages: 1) pre-training a base policy to produce a wide variety of behaviors, and 2) online-updating only certain weights in the model according to human feedback. The key to our fast, yet simple adaptation is that general interaction dynamics between agents and objects are fixed, and only object-specific preferences are updated. Our adaptation occurs online, requires only one human intervention (one-shot), and produces new behaviors never seen during training. Trained on cheap synthetic data instead of expensive human demonstrations, our policy demonstrates impressive adaptation to human perturbations on challenging, realistic tasks in our user study. More details provided in the paper.

Ablation Studies

Zero-shot Transfer to Different Environment Settings

A major advantage of defining our policy relative to objects and learning features for object-centric behavior is that our policy naturally handles different environment settings. This is crucial: robots should be flexible in their behavior when objects get shifted and rotated around. The video below shows an extension of task 3 from our paper that examines behavior of the policy as the scanner's pose changes. Given human correction in an initial setting A, the policy should still behave correctly in different settings B and C without the need for additional supervision (zero-shot).

Four episodes are shown: (1) human correction in setting A, (2) updated robot policy in setting A, (3) same policy transferred to setting B, and (4) same policy transferred to setting C.
We can see that the robot's behavior can change dramatically to handle changes to the scanner's pose. This shows the power of learning behavior for specific objects. While this object-centric paradigm may not solve all tasks, we argue this is crucial for successful generalization.

"Singularities" of Potential Fields

In Section 3-B of the paper, we mentioned how the position relation network overall outputs a push-pull force on the agent. We allow this direction to be freely determined by the network. Potential field methods also use this approach, but constrain the direction to be parallel to each agent-object vector. Only magnitude and sign of the vector can change. This may seem like an intuitive way to enforce structure in the network and reduce complexity, but this constraint fails during “singularities” where no orthogonal component is available to avoid an obstacle lying in the same direction as the goal.

Forced vs free force contributions
The above video compares "forced" and "free" direction behavior respectively. As the blue agent moves to the yellow goal in both cases, pay attention to the "force" contribution of the goal and the repel object shown as yellow and red arrows respectively. Notice how in the "forced" version, the model correctly predicts a force vector pointing away from the red repel object. However, since there is no orthogonal component, the blue agent cannot avoid moving straight through the repel object as the goal force vector dominates. On the right side, however, the force direction contributed by the repulsor object has an orthogonal component, allowing the agent to avoid the repulsor.

Scalability with Objects

Graph-based models and neural networks have an advantage of being invariant to the order and number of objects in a scene. We only trained our model's position relation network in scenarios with two other objects: repel and attract. Ideally, our model should be able to generalize to more objects. Here, we examine this generalization ability. The below video shows our model on scenarios with 4 objects: two repel, two attract. The text at the center of each object displays the attention value placed on that object in the format att: value.

Model behavior with 4 objects
Overall, our model seems to behave well. Also, looking at the first few seconds, we see that our model correctly places high attention on objects directly "ahead" of it and decreases attention once "past" them. This is because of our state-based feature Goal-Relative Distance (Section 3-C). Next, we increase the number of objects to 6:
Model behavior with 6 objects
Clearly, our model begins to suffer issues such as getting stuck in local optima when the force contributions of objects equalize. This shows that local optima's are still a problem when a method lacks the ability to plan in the future. Also, our model seems to lose its ability to prioritize objects that are nearby and gets repelled by objects very far away. We can see that both close and far away objects are given similar attention values. This is strange since one of the state-based features that we provide the model is size-relative distance (Section 3-C), so ideally this issue would not arise. We are investigating this, but possibly a work-around is to simply introduce a training curriculum during our model's pre-training phase. Initially only train with 2-object scenarios, and slowly increase this as training progresses.

Tuning the Adaptation Learning Rate

Coming Soon: Comparison of Stochastc Gradient Descent vs Learned Optimizer (Learn2Learn) vs Recursive Least Squares