This page describes a project by Andrew Wilson and Stephen Intille for the MIT Media Lab Fall 1995 class Learning for Interactive Agents.
The goal of this project is to implement an interactive agent in the multi-player tank game Bolo that assists the user playing the game by recognizing the game player's actions and responding appropriately. This is achieved using a robot tank that follows the user's tank as the game is played. The robot watches what the user is doing and tries to recognize actions like "attacking nearest enemy tank" using features computed from the game world. Once an action is recognized, the robot tank changes its behavior to help the user. For instance, if the user is judged to be attacking a tank, the robot will join in the attack. We call our robot tank Inducting Indy.
The inspiration for teaching a the robot collaborative behavior by demonstration comes from how people teach collaborative agents in the real world: first you tell your new assistant exactly what you are doing, and then you demonstrate on a real example. Later, after the assistant has seen enough examples he can judge for himself when the time is appropriate for the same course of action, without you having to tell it so.
This document describes our current system and the issues that it raises.
Bolo was created by Stuart Cheshire (firstname.lastname@example.org) of Stanford University. It has evolved from a little-known tank game on the BBC Micro to an explosively successful Macintosh game that is played constantly on the Internet, as well as many smaller networks. The author summarizes his work:
"Bolo is a 16 player graphical networked real-time multi-player tank battle game. It has elements of arcade-style shoot-em-up action, but for the serious players who play 12 hour games with 16 players working in teams in different networked computer clusters around an office or university campus, it becomes more of a strategy game. You have to play it to understand."
(This description of Bolo is from The Bolo Home Page.)
The graphics and controls are pretty simple, as is the objective... conquer all the bases and pills.
Bolo brains are plug in pieces of code for Bolo that are passed the same (well, very similar) information as the human players get, and return what actions they wish the tank to perform. Brains are Bolo autopilots.
There are currently two main types of brains. The standard brain, known as a robot or 'bot, uses some form of AUI (Artificially Un-Intelligent) technique to play Bolo. The second type of brain, known as a cyborg or 'borg reads from the mac keyboard and essentially just helps a human play the game. These two types are merging.
In this work we modify an autonomous 'bot to create Inducting Indy.
This description is from Bolo Brains.
We chose Bolo as an environment in which to study learning algorithms for several reasons:
The only limitation we have discovered with using Bolo as a platform is that is is impossible to stop the game and allow a learning algorithm to "think." However, this could also be thought of as a "feature" that grounds the learning algorithm to a realistic domain and forces us to consider the issue of the user interface more carefully. Even simple learning algorithms running on fast machines require far more processing time than a tank receives each clocktick. We will describe how we have worked around this issue. We have found that source is available on request for three 'bot Brains:
We therefore started by making modifications to Indy 2.02 code. (See the Brain list for more information on the available Bolo Brains.) Indy 2.02 was written by John Lawrie (email@example.com) and Gersham Meharg (firstname.lastname@example.org).
A Bolo brain tank has access to the following types of features, any of which might be needed by a learning algorithm to recognize certain types of actions:
See Bolo Tidbits for a list of other information available to the robot that can potentially be used for feature computation.
Given the vast number of potential features, pruning the set to a subset that the learning algorithm can work with requires using some domain knowledge about the task and making guesses about which features truly provide the most information for the task at hand. For instance, we know that since the actions we want the robot to recognize are very local, we do not provide the learning algorithm with information about tanks out of the immediate area.
There is no direct way to access features from the game with temporal exent (e.g. derivatives of properties), although they could be computed with some effort. We have not used temporal features here.
The goal of the learning algorithm is to take as input a set of features and corresponding decisions and to produce a compact set of decision rules. The rules are used to evaluate incoming features and make decisions about what action is taking place. PRISM is an algorithm based on ID3 for computing disjunctive rules of features from a given input set of features and decisions. Features are supplied in a form such as ((shooting tank-near base-far ATTACK) (not-shooting tank-near base-near NOT-ATTACK)) and the resulting rules are of the form ((shooting & tank-near -> ATTACK) (base-far -> NOT-ATTACK)). For details on the algorithm, see the original article, [J. Cendrowska. "PRISM: An algorithm for Inducing Modular Rules." J. Man-Machine Studies (1987) 27, 349-370.]
PRISM will produce a correct rule to fit every possible combination of features. Consequently, it tends to over-fit data, producing a specialized rule for too many situations. INDUCT, described in Gaines89, is an extension to PRISM which tries to eliminate overly specialized rules at the expense of possibly generating some incorrect rules. Our data violates two assumptions made by PRISM: it is not complete and our examples are not always correct. Consequently, PRISM discovers some inconsistencies and also generates too many specialized rules. Therefore, we have applied INDUCT to the input. We are not guaranteed correctness and we don't avoid inconsistencies, but some over-fit rules are avoided.
A typical training session with our learning 'bot, Inducting Indy, begins by making sure that both the 'bot and the human player are allied to each other. In the course of the game, the human player shows Inducting Indy examples of an activity (e.g. "attack the pillbox") that the players wants the 'bot to perform. Just before attacking the pillbox himself, the player sends a message indicating what action he is performing. The inducting 'bot recognizes this message and accordingly tags the features of the world that it stashes to disk.
In Bolo, sending messages to other tanks in a timely fashion can be cumbersome when the player is trying to control his own tank. In our system the player can send one of a set of preconfigured messages with a single keystroke; we've also experimented successfully with Apple's PlainTalk speech recognition software and AppleScript to link spoken messages to the keystrokes that actually send the messages.
When the player stops the action that he is demonstrating, he then sends Inducting Indy a message indicating that the action is over for now.
After a fixed amount of time in the game (a few minutes) the action rules are recomputed by the INDUCT algorithm described above. The inducting 'bot then uses these new rules to classify the actions of the player. Later, if the player happens not to tell the 'bot what he's doing, and if a particular rule fits the set of features, our 'bot concludes that the player must be engaged in the action corresponding to the rule. He then adopts this action as his own.
During the course of the game, when the 'bot is comparing the state of the world from the player's point of view against its set of rules, it's also free to pursue its own goals (the ones built in by the creators of Indy) so long as it doesn't wander too far from the player that it can't see what he's doing, and none of the rules is firing.
The training doesn't stop when a single example has been shown to the 'bot, however. The player is free to show the 'bot more examples of the action. In fact, this must happen for INDUCT to derive rules that are sufficiently general to be useful. For example, the rules derived from a single example of "attack the pillbox" will probably count on the particular pillbox being attacked, the direction the tank is pointed, and a host of other features which in this case happen to be irrelevant. Since each time the rules are recomputed from the entire set of examples from the whole session, the rules continually evolve, becoming more general and therefore more useful.
This periodic computing of the rules permits a style of interactive "teaching" that should be useful in a game situation. After showing the 'bot a few examples of an action, the player can then see if the it has learned the appropriate concepts by seeing if it does the right thing in a new situation. If it does -- great. If it doesn't, the player can add a training example on the spot rather than declare a total failure of the 'bot's configuration.
The learning algorithm will sometimes receive one set of features with two different decisions. For example, temporal delays by the user in labeling may lead to a (shot-in-air ATTACK) input and a (shot-in-air EXPLORE) input. A shot may still be in flight when the user switches from attacking to exploring. This type of contradictory input generates unresolvable conflicts in the PRISM and INDUCT learning algorithms. Therefore, we have added a modification to the learning code that keeps track of the number of each type of example seen over the entire game. If 10 examples of (shot-in-air ATTACK) are received and only 2 examples of (shot-in-air EXPLORE) are received, then only the (shot-in-air EXPLORE) rule will not be used in training. However, if at a later time 9 more examples of (shot-in-air EXPLORE) are received, then there will be 1 more example of that case than (shot-in-air ATTACK), and (shot-in-air EXPLORE) will be the rule that is used. This method of keeping track of multiple examples and eliminating conflicts based on the number of examples received allows an action to be trained and then retained even if a few spurious inputs are received that directly contradict the data used to learn the rules.
Once the Inducting Indy 'bot has recognized some behavior, it tries to respond. Currently it can recognize attack-tank, attack-base, attack-pill, fortify-base, hiding, and exploring. It can respond by attacking the appropriate object (using different strategies depending upon the object), fortifying a base with mines, or hiding in the same region of forest as the player. Temporal smoothing of the 'bot's actions is achieved by extending each action for a few seconds in time. For example, if a HIDING decision is made, the tank will try to hide for the next few seconds unless it receives another action, like ATTACK-TANK, in which case it will stop hiding immediately and respond to the new behavior. When no actions are recognized, Inducting Indy goes about its normal business as it follows the player around.
The kind of interactive, incremental, learn-as-you-go method of training described in the previous section seems particularly suited to teach collaborative agents.
But it's also useful in teaching the teacher. Consider what Inducting Indy is actually doing: it's recognizing how you go about attacking a pillbox or whatever. Later, when it recognizes you attacking the pillbox, it proceeds to attack the pillbox in it's own way. The teacher stands to learn how in this case the original Indy code goes about attacking a pill, which in many cases (certainly in ours) may be more effective than our training examples.
When training all of the recognized actions simultaneously, rule sets of 100 to two hundred rules are created. Here's an example set of rules for a small test case where HIDING, FORTIFY-BASE, and ATTACK-PILL were quickly trained by playing the game for about 10 minutes:
dec-hiding <= moving-slowly terrain-forest nearest-obj-at-29-31
dec-hiding <= stationary terrain-forest
dec-hiding <= terrain-forest nearest-obj-at-30-31
dec-fort-base <= lgm-far nearest-obj-at-73-46 not-facing-nearest-obj terrain-refbase
dec-fort-base <= nearest-obj-at-63-43 stationary terrain-refbase
dec-fort-base <= is-shooting terrain-refbase
dec-fort-base <= lgm-close terrain-refbase
dec-att-pill <= stationary nearest-object-info-3 target-at-70-71
dec-att-pill <= refbase-close lgm-near nearest-object-info-3
dec-att-pill <= refbase-close moving-fast pillbox-close bman-nearest-obj
dec-att-pill <= terrain-crater
dec-att-pill <= target-at-68-57 nearest-object-info-3
dec-att-pill <= nearest-pillbox-info-3 moving-slowly bman-nearest-obj
dec-att-pill <= nearest-obj-at-68-57 stationary lgm-near target-at-70-71
dec-att-pill <= is-shooting target-at-70-71
dec-att-pill <= moving-slowly is-shooting
dec-att-pill <= shot-close
These rules were sufficient so that the player could attack an enemy pill and Inducting Indy would then start attacking the same pill. In addition, the player could hide in the forest and Inducting Indy would come hide nearby in the same forest. Finally, if the player started to fortify a base, Inducting Indy would come help out.
The rules are not perfect, but given more training they would cover more cases. In general, even this small set of rules led to interesting (and useful) performance by Inducting Indy.
One 'bot called Speilborg generates a quicktime segment of a bolo game (see Bolo Brains for more info). We have used Speilborg to generate a QuickTime movie of Inducting Indy and overlayed sound, so that we can describe the training as the game progresses.
This example segment shows a short sequence of training the Inducting Indy 'bot to attack a pill. Inducting Indy can then recognize an ATTACK-PILL action and respond by also attacking the same pill.
The system demonstrates that it is possible to adapt a machine learning technique like PRISM to an interactive, real-time situation for the purposes of teaching collaborative behavior. It works well ... better than we had expected.
Some possible extensions are:
On the Mac Indy 2.02 C code was ported to compile in MetroWerks CodeWarrior on a PowerPC. The Think C version had a code segment size limitation. Indy was modified in several ways:
The PRISM and INDUCT learning algorithms were implemented in Common LISP for flexibility and ease of programming and run under LispWorks on a fast HP735 workstation. They read rules from a file output by the mac Bolo robot, compute rules, and output the rules to a file read by the Bolo robot. Some effort was required to get the learning algorithm and the Bolo Brain communicating via files. Presently, the algorithm runs fast enough that it could probably be ported to MacLISP.
A significant amount of time was required to write feature computation routines and to tweak which features should be provided to the learner. Testing the algorithm required running two or three tanks simultaneously. Testing was especially difficult, given our ineptitude at actually playing the game! A new game map was created to make the training a bit easier.
The C and Lisp code for Inducting Indy is available to anyone who wants it. Send a request via email. The version here ran on a Unix workstation running Lisp and communicated with the Bolo C process via files. This was a kludge that requires fixing. If you want to use Inducting Indy you have two options: either convert the Lisp code to MacLisp and get the C and Lisp processes communicating properly, or rewrite the PRISM algorithm in C code.
Andrew Wilson and Stephen Intille were graduate students working on action recognition and computer vision in the MIT Media Lab Vision and Modeling Group at the time of this project. We'd appreciate any comments that you may have on this work or on possible extensions.
Last modified: 4/97