.: Affective Engagement with an Emotionally Expressive Robot :.
Adam Setapen
Personal Robots Group, MIT Media Lab
MAS.630 Affective Computing
Fall 2011
I really feel like many of the tricks we try to use as animators for creating appeal and life could be directly applied to robots to great effect.
--Doug Dooley
Motivation
A social robot must exhibit emotional responses to be a believable character. A robot that is capable of long-term social interactions should not only react with emotional responses, but it should store all interactions and allow for dynamic adaptation of the mapping from internal state to external action. Kombusto the DragonBot, seen in the upper right hand corner, is a new robot platform I have built over the past year for investigating long-term human-robot interaction (HRI). The primary purpose of my project was to focus on building a refined emotion system for DragonBot that would be capable of supporting social relationships through long-term human robot interaction.
This work was largely inspired by a presentation given by Doug Dooley at the MIT Media Lab. In his presentation, Dooley focused on applying animation principles for robot characters to robots in the real world. His thesis was that combining a robot's motivation and emotional state leads to a compelling character. Similar to how actors consider their character's intentions, motivation forms the foundation of building a believable character. An agent trying to guilt will look much different than an agent trying to command.
However, motivation is not enough to make a compelling creature. The emotions of a character are what the viewer notices most. Integrating different emotional states with a character's motivation should lead to noticeably different behavior. Using the command example, different characters will want to command in varying ways. Some characters will command others with an emotion of happiness, while other characters will command with an emotion of anger or nervousness. How a character's motivations combine with their emotions defines who that character is. An example of a robot interaction with a flat emotional response vs. one with a multimodal response can be seen in Figure 1.
Robot with a flat emotional response |
Robot with multimodal emotional responses |
|
|
The primary contribution of this project is an integrated motivation-emotion system for a robot that integrates facial expressions, body language, and sounds. This system logs all data and can dynamically alter the robot's behavior through its emotional state without recompiling any code. Furthermore, I present a qualitative analysis of data from a pilot study evaluating the viability of a task comparing interactions with emotional responses to flat responses.
DragonBot
The appeal of a robot is analogous to charisma in an actor. An appealing character is unique, believable, interesting, and engaging. Taking inspiration from the pioneering animation techniques of Disney's Thomas and Johnston, I kept the twelve principles of animation in mind when deciding how emotions and motivations should map to actions. Just as these principles paved the way for advances in modern computer animation, they are also vital for embedding believable actions into a socially-embodied robot.
The robot used in this project, DragonBot, was created as a long-term learning companion for children. The design of DragonBot was informed and driven by classical and modern animation principles, as a long-term companion must be a believable character. DragonBot supports long-term interactions through it's core functionality which runs on an Android cellphone. DragonBot can run as a physical robot (with the phone "inside" the face of the robot), or the phone can be removed and the user can interact with a virtual avatar of DragonBot. As all high-level code runs on the phone (including motor control, vision and audio processing, animated virtual face), we can provide continuity of character between a physical and virtual interaction. Furthermore, the internet connection of a modern smartphone allows data to be pushed to and pulled from the cloud at runtime. Finally, we can also offload computationally intensive tasks to the cloud, receiving and integrating results as they are available.
Mechanically, DragonBot has 5 degrees of freedom (DOFs). Four DOFs (cartesian X, Y, and Z + head rotation) come from a novel parallel manipulator that connects the stationary base of the robot to a mobile head platform. The fifth degree of freedom is a neck-tilt on the head platform, allowing the robot to look down at objects on a table or gaze up at people. The face of DragonBot consists of over 200 animated 3D nodes which are displayed on the screen of the phone through a two-dimensional orthogonal projection. The use of such a complicated facial rig within a robot leads to a lot of expressivity in the facial expressions that the robot is capable of achieving.
DragonBot's actions are entirely controlled through a procedurally blended animation framework. This is paramount for providing an extensible method of creating novel motions and expressions without recompiling any code. Furthermore, because the robot's virtual and physical nodes are holistically controlled using animation techniques, we can procedurally blend between different facial expressions and body poses. For example, we can combine two expressions and dynamically alter their weights, leading to robot responses that can be evolved and improved through each interaction.
|
|
|
|
|
|
The emotion system
This project marks the beginning of my research on robots that are emotionally aware and responsive. This project focuses on synthesizing the robot's emotional responses and building the back-end for a dynamic and scalable integration of emotion into a robot's behavioral system. There are many pieces of the emotion system still missing - namely the ability to pick which emotional state the robot is in and to make inferences about the emotional state of a user. These issues are discussed in the future work section.
The emotion-based action system of the robot is summarized by the following diagram:
Figure 6 - flow-chart illustrating the connections of the emotion system
The system was implemented in Java, and at the top level lies the motivation manager. It should be noted that motivations can be triggered by external stimuli This hierarchical structure handles the robot's high level motivations - things like find faces or look at an object. The top level of the dynamically-sized hierarchy is always active, and it basically acts as a simplified sympathetic nervous system. It makes the robot jump away from objects that are going to collide with it and makes the robot want to both breathe and blink. It makes the robot respond to people "poking its face", utilizing the touchscreen on the Android phone. Under this can be multiple layers, and the motivation manager parses motivations from the top down.
Under this level comes the action selector, which takes input from the motivation manager and external stimuli and maps them to action primitives - basically a naive planner. The action selector considers stimuli first and then the robot's motivations, mapping to basic actions like "blink", "jump forward", "look at X, Y, Z". These primitive actions are sent into the emotion manager, which regulates a graph of the robot's emotional state over time. The emotion manager uses the emotion annotation representation language (EARL) XML schema, and it supports both categorized emotions (like "happiness" or "boredom") and emotions based on valence, arousal, and power.
Next comes action modification, which considers the primitive action selected and maps it to a new action by considering the current emotion. This step basically applies an "emotion trajectory" to the primitive action, blending the primitive action with expressivity. For example, if the primitive action is "lean towards X, Y, Z", the robot will move much quicker if it is "excited" vs "bored". This is also the stage that adds facial expressions and sounds to an action based on the robot's emotional state. At this stage, we basically have the following information: perform primitive action X with emotion Y by applying MAP(X, Y). We log all data about motivations, stimuli, action selection, etc. and commit to performing the action. The sound is time-synchronized with the motion and the action is performed by the robot. The MAP(X, Y) function contains blending weights that can be changed at runtime for altering the mapping in realtime.
Evaluation
To evaluate the impact of the robot's emotional responses in human-robot interactions, I am planning to use the 20Q task for a benchmark. 20Q is a computer-based game drawing inspiration from Twenty Questions, a spoken parlor game popularized in the 1940s that encourages deductive reasoning and creativity. In 20Q, a human player is asked to think of an object and the computer attempts to guess what they are thinking of with twenty yes-or-no questions. If it fails to correctly guess the object in 20 questions, the computer asks an additional 5 questions. If the computer cannot guess within 25 questions, the human player is declared as the winner. The AI of 20Q uses an artificial neural network to pick which questions to ask and to handle the object guesses.
The highly-specialized AI of 20Q is impressive. If a person plays more than 2 or 3 games, there is a good chance 20Q will guess at least one of the user's objects. Not only does this domain-specific program work remarkably well, but the entire neural network has been stuffed into an integrated circuit and appears in a popular children's toy, seen in Figure 7.
Figure 7 - the 20Q handheld toy
20Q is being used as a benchmark because it is a straightforward task that can be completed with a social or non-social interaction. In this project, I am attempting to find out whether we can distinguish any differences in a subject's EDA between a web-based session of 20q and a human-human interaction of the same task. If differences exist between these conditions, a comprehensive study will be run that compares the following four conditions:
- C - computer
- H - human
- RE - robot with emotional responses
- RF - robot with flat emotion
The evaluation which follows investigates any differences between the human interaction scenario (H) and the computer interaction scenario (C). Six subjects performed each condition, with half of the subjects getting C first and half initially receiving H. Each subject played two games of 20q under each condition, for a total of 4 games per participant.
The sources of data collected in this experiment consist of video and electrodermal activity (EDA). A full-frame video was taken that captures the participant's posture and facial expression. Furthermore, when the participant is doing the task with the computer, the screen of the computer is recorded along with a webcam stream aimed at the user's face. Finally, electrodermal activity is obtained through the Q Sensor from Affectiva, seen in Figure 8. Initially after putting the Q sensor on their dominant hand (with electrodes on palm-side of the wrist), the subjects were prompted to run up and down a set of stairs to improve the connection between the Q sensor and their skin. After this, the subject stepped into the study room and pushed the button on their sensor to aid in synchronizing video and EDA signals.
Figure 8 - Affectiva's Q Sensor Palm for measuring Electrodermal Activity (EDA)
Results
Out of the six participants, one subject inadvertently turned off their sensor early in the experiment, rendering their data useless. Many different filters were run on the subject's data to provide insight into the trends in the data. The filters that worked best were exponential smoothing (with alpha=.03) and low-pass filters with 30 second and 1 minute window sizes. The subjects' EDA data can be seen in Figures 9 - 13, with filtered signals overlaid and the class separator identified.
|
|
|
|
|
|
|
|
|
Two metrics were used for analyzing the differences between the two classes of data. The first, peak ratio, simply compares the number of peaks in the human scenario to the computer scenario. A simple peak-detection algorithm that I wrote many years ago for analyzing accelerometer data worked quite well for the EDA signal. An example of the peak detection results can be seen in Figure 14. The other metric, mean ratio, compares the overall means of the two time-separated segments. Results of these analytics (per subject) can be found in Table 1.
|
|
|
Subject |
Number of peaks (human) |
Number of peaks (computer) |
Peak ratio (human / computer) |
Mean ratio (human / computer) |
Subject #1 |
19 |
6 |
3.167 |
1.257 |
Subject #2 |
19 |
6 |
3.167 |
0.825 |
Subject #3 |
78 |
18 |
4.333 |
1.548 |
Subject #4 |
33 |
70 |
.471 |
2.065 |
Subject #5 |
52 |
48 |
1.083 |
0.993 |
Qualitatively, these results are quite interesting. Visually, there is a noticeable difference in the EDA response between human interaction and computer interaction in the 20q task. All but one of the subjects exhibited more peaks in their EDA signal during condition H than condition C. However, this individual's mean ratio increases by a factor of two during condition H.
The peaks in the EDA signals with the biggest magnitude tended to come from the beginning and end of games. The process of "thinking of an object" tended to get people's EDA to increase, as did losing a match. Also, there were a few peaks that occurred when the user became frustrated that the AI was asking redundant questions. For example, question 4 was "Does it have four legs?" and the response was "yes", followed by "Does it move?" on question 8. This was particularly noticeable in the H condition, where users seemed to attribute the computer's common-sense errors to the human experimenter.
Conclusions
These initial results suggest there are trends in an individual's EDA response between these two conditions that are noticeable by human inspection. However, these differences don't seem to be confined to a single axis of measurement. Furthermore, because two people's EDA responses can vary so much, writing an algorithm to pick "Human-or-computer" purely through EDA would prove to be difficult. 20Q successfully extended the same task across computer and human interactions, but it's not obvious that this is the right task for evaluating the robot's emotional system.
While the emotion system I implemented has not been evaluated, it will be essential for a framework of scalable and believable characters. One particular problem I encountered in designing a study to evaluate emotional responses is that a robot with flat responses seems "dead". A challenge I faced with evaluating this system was defining a task that would allow for a fair comparison between a robot with emotions and one without. One idea I recently had is to remove individual axes of expressivity (facial expression, motion, sound) to analyze which is most important, but this doesn't assess the overall impact of the emotional responses.
Lastly, as a morally-conscious scientist, I should mention a few possible sources of bias in my pilot study. I was the person who acted as a conduit from 20q.net to the human in the H condition, and it should have been someone who didn't know the motivations of the experiment. Also, I already had social rapport with many of the subjects, as I advertised my study internally at the Media Lab. These factors could have skewed the data, and the same human-human interaction with an unfamiliar person might lead to very different EDA responses.
Future work
As mentioned previously, this projects marks the beginning of my work into emotionally capable social robots. I now have a framework for expressive action, but there are many steps before the robot can understand people's emotion and autonomously modulate it's own affective state. One important next step is to design a way for the emotions of the robot to be controlled autonomously based on external stimuli. Another big task is emotion recognition of users, based on facial expressions, body pose, and prosody. I look forward to refining DragonBot's emotion system in the coming months!
References
F. Thomas and O. Johnston. The Illusion of Life: Disney Animation. Abbeville Press New York, 1981.
J. Lasseter. Principles of traditional animation applied to 3d computer animation. ACM SIGGRAPH Computer Graphics, 21(4):35–44, 1987.
R. Picard. Affective computing. The MIT Press, 2000.
B. Reeves and C. Nass. The Media Equation: How people treat computers, television, and new media like real people and places. CSLI Publications and Cambridge University Press, 2003.
R. Gockley, A. Bruce, J. Forlizzi, M. Michalowski, A. Mundell, S. Rosenthal, B. Sellner, R. Sim- mons, K. Snipes, A.C. Schultz, et al. Designing robots for long-term social interaction. In 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, 1338–1343. IEEE, 2005.
J. Merlet. Parallel robots. Springer-Verlag New York Inc., 2006.
Cory Kidd. Designing for long-term human-robot interaction and application to weight loss. PhD thesis, Massachusetts Institute of Technology, 2008.
U. Hess, R. Adams, and R. Kleck. The face is not an empty canvas: how facial expressions interact with facial appearance. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535):3497, 2009.
R. Wistort and C. Breazeal. Tofu: a socially expressive robot character for child interaction. In Proceedings of the 8th International Conference on Interaction Design and Children, 292 – 293. ACM, 2009.
Jesse Gray, Guy Hoffman, Sigurdur Orn Adalgeirsson, Matt Berlin, and Cynthia Breazeal. Expressive, interactive robots: Tools, techniques, and insights based on collaborations. In HRI 2010 Workshop: What do collaborations with the arts have to say about HRI?, 2010.
L. Takayama, D. Dooley, and W. Ju. Expressing thought: improving robot readability with animation principles. In Proceedings of the 6th international conference on Human-robot interaction, 69–76. ACM, 2011.
Supplemental Materials
Class presentation
Here is my class presentation in both PDF and Keynote formats.
Teleoperator interface
A specialized teleoperator interface was made for the 20Q task, seen in Figure 15. This application runs on an Android-based tablet and allows a person to remotely control DragonBot. Furthermore, the teleoperator interface receives a video stream from DragonBot, performs facial recognition (and sends any results back to DragonBot), and saves all images (either to disk or to Amazon Web Services). Furthermore, realtime audio streams between the phone controlling DragonBot and the teleoperator tablet. The teleoperator interface was designed so the controller focuses a majority of their cognitive effort on the interaction. Motions, facial expressions, and sounds can be triggered from the tablet-based interface, and the controller has no low-level control over the robot. In this fashion, we are restricting the operator to thinking about high-level actions which is meant to facilitate a greater focus on the interaction. The teleoperator interface can be seen in Figure XXX.
Figure 15 - 20Q teleoperator interface
Videos
Two videos showing DragonBot in action are below. Figure 16 shows differences in a robot's responses with a flat response and multimodal emotional reactions. Figure 17 is an older video describing the overall goals of the platform.
Figure 16 - A comparison of multimodal emotional responses vs flat responses
Figure 17 - An introduction video outlining the research motivations
EDA analysis code
Here is my MATLAB code for parsing EDA data, running analytics on it, and printing/plotting results. Anonymized EDA data is included in the archive.