Kai-yuh Hsiao
I'm a grad student working at the Media Lab at the Massachusetts Institute of Technology ,
in the Cognitive Machines
Group, under Professor
Deb Roy.
Our group works with combining multimodal inputs in computer
systems, especially with application to language comprehension. I'm
currently primarily involved with our robot project. Here is our
robot's head:
and here is the whole robot:
We call our robot "Ripley", because it looks like the alien in
"Alien", and Ripley was the character who fought the aliens. Ripley
has cameras and microphones and sensors on it so it can take input
from the outside world, process it, and then learn to manipulate it.
Eventually, there will basically be four parts to Ripley. First
there's the motor control part. The microcontroller in charge of the
robot's motion is a PC104, which has a Java virtual machine running on
it. This is linked via serial cable to an RT-Linux system which runs
a GUI and a PVM interface so the robot can be controlled smoothly and
(hopefully) intelligently. I've been working on all that stuff. Here
are some videos of what Ripley can do today: 1) Ripley copying some motions; and 2) Ripley deciding whether objects are
heavy or light.
The next part is the vision system. The robot has to be able to
process its visual input and extract features, such as the position,
shape, and color of objects in its field of view. Niloy has been
working on a vision system for picking up simple objects.
Then there is the language system. From spontaneous speech input,
the system needs to extract features like the words in each utterance,
where they are relative to each other, and how those words relate to
other words it has heard before. Eventually, our hope is that the
system will be able to acquire grammar and language concepts over time
without supervision and without prior knowledge. This would enable it
to learn and understand multiple languages. Niloy and Peter have been
working on those parts.
Finally, there is the top level, where the features extracted from
the visual input and the speech input are combined in order to find
patterns and learn the important associations. We've all started
thinking along those lines, and it's going to take a lot of thought.
I've been looking at things about human attention and goals, to see if
that might fit into a robot system too.
This all falls under the category of grounded language
acquisition. This term refers to the belief that our own human
languages are not just a bunch of words defined by other words, but
that our common shared experiences as a species have allowed us to
form language associations that relate to real-world concepts. That
way, when I say something, you understand me because what I say
relates in your mind to something that we've both experienced at some
point. This is called "grounding" because our language is "grounded"
in something in our real-world experience.
updated 2002/03/30 by eepness at media dot mit dot edu