Kai-yuh Hsiao

I'm a grad student working at the Media Lab at the Massachusetts Institute of Technology , in the Cognitive Machines Group, under Professor Deb Roy.

Our group works with combining multimodal inputs in computer systems, especially with application to language comprehension. I'm currently primarily involved with our robot project. Here is our robot's head:

and here is the whole robot:

We call our robot "Ripley", because it looks like the alien in "Alien", and Ripley was the character who fought the aliens. Ripley has cameras and microphones and sensors on it so it can take input from the outside world, process it, and then learn to manipulate it.

Eventually, there will basically be four parts to Ripley. First there's the motor control part. The microcontroller in charge of the robot's motion is a PC104, which has a Java virtual machine running on it. This is linked via serial cable to an RT-Linux system which runs a GUI and a PVM interface so the robot can be controlled smoothly and (hopefully) intelligently. I've been working on all that stuff. Here are some videos of what Ripley can do today: 1) Ripley copying some motions; and 2) Ripley deciding whether objects are heavy or light.

The next part is the vision system. The robot has to be able to process its visual input and extract features, such as the position, shape, and color of objects in its field of view. Niloy has been working on a vision system for picking up simple objects.

Then there is the language system. From spontaneous speech input, the system needs to extract features like the words in each utterance, where they are relative to each other, and how those words relate to other words it has heard before. Eventually, our hope is that the system will be able to acquire grammar and language concepts over time without supervision and without prior knowledge. This would enable it to learn and understand multiple languages. Niloy and Peter have been working on those parts.

Finally, there is the top level, where the features extracted from the visual input and the speech input are combined in order to find patterns and learn the important associations. We've all started thinking along those lines, and it's going to take a lot of thought. I've been looking at things about human attention and goals, to see if that might fit into a robot system too.

This all falls under the category of grounded language acquisition. This term refers to the belief that our own human languages are not just a bunch of words defined by other words, but that our common shared experiences as a species have allowed us to form language associations that relate to real-world concepts. That way, when I say something, you understand me because what I say relates in your mind to something that we've both experienced at some point. This is called "grounding" because our language is "grounded" in something in our real-world experience.



updated 2002/03/30 by eepness at media dot mit dot edu