Programming a Bolo Robot:

Recognizing Actions By Example

This page describes a project by Andrew Wilson and Stephen Intille for the MIT Media Lab Fall 1995 class Learning for Interactive Agents.

The Goal

The goal of this project is to implement an interactive agent in the multi-player tank game Bolo that assists the user playing the game by recognizing the game player's actions and responding appropriately. This is achieved using a robot tank that follows the user's tank as the game is played. The robot watches what the user is doing and tries to recognize actions like "attacking nearest enemy tank" using features computed from the game world. Once an action is recognized, the robot tank changes its behavior to help the user. For instance, if the user is judged to be attacking a tank, the robot will join in the attack. We call our robot tank Inducting Indy.

The inspiration for teaching a the robot collaborative behavior by demonstration comes from how people teach collaborative agents in the real world: first you tell your new assistant exactly what you are doing, and then you demonstrate on a real example. Later, after the assistant has seen enough examples he can judge for himself when the time is appropriate for the same course of action, without you having to tell it so.

This document describes our current system and the issues that it raises.

What is Bolo?

Bolo was created by Stuart Cheshire (cheshire@cs.stanford.edu) of Stanford University. It has evolved from a little-known tank game on the BBC Micro to an explosively successful Macintosh game that is played constantly on the Internet, as well as many smaller networks. The author summarizes his work:

"Bolo is a 16 player graphical networked real-time multi-player tank battle game. It has elements of arcade-style shoot-em-up action, but for the serious players who play 12 hour games with 16 players working in teams in different networked computer clusters around an office or university campus, it becomes more of a strategy game. You have to play it to understand."

(This description of Bolo is from The Bolo Home Page.)

The graphics and controls are pretty simple, as is the objective... conquer all the bases and pills.

Bolo Brains

Bolo brains are plug in pieces of code for Bolo that are passed the same (well, very similar) information as the human players get, and return what actions they wish the tank to perform. Brains are Bolo autopilots.

There are currently two main types of brains. The standard brain, known as a robot or 'bot, uses some form of AUI (Artificially Un-Intelligent) technique to play Bolo. The second type of brain, known as a cyborg or 'borg reads from the mac keyboard and essentially just helps a human play the game. These two types are merging.

In this work we modify an autonomous 'bot to create Inducting Indy.

This description is from Bolo Brains.

Why Bolo?

We chose Bolo as an environment in which to study learning algorithms for several reasons:

Interactivity: The game makes it possible to test learning algorithms in a real-time, interactive environment where feedback to the user must be natural and fast.
Feature complexity: The key question for an interface is "which features matter?" Features in Bolo's world can be easily extracted from the game and studied by a learning algorithm to try and decide which ones are critical for recognizing various actions. However, the number of features can also be restricted to a smaller subset to make the problem easier to study.
Action complexity: Bolo is complex enough to contain many different types of actions that an agent might recognize and respond to, but the game is structurally simple enough to make using it as a research platform possible.
Game code: Source code to the Bolo game has not been released. However, the game is widely used and very stable. It is one of the few real-time, interactive games that takes place in a complex, simulated world with several different types of objects. The game is popular because it permits many different types of actions, but it is possible to modify (and simplify) the map of the game world when necessary to make experimentation easier.
Brain modules: Source code for plug-in brains is available. These are sophisticated robot tank algorithms that can be modified without too much trouble to compute and output game features to a learning algorithm and to react to another tank's actions.
Support: There is a large group of Bolo fans on the network who are willing to answer questions about Bolo and Bolo brains.
Research: There are a few other people using Bolo for series AI and psychological experimentation.

The only limitation we have discovered with using Bolo as a platform is that is is impossible to stop the game and allow a learning algorithm to "think." However, this could also be thought of as a "feature" that grounds the learning algorithm to a realistic domain and forces us to consider the issue of the user interface more carefully. Even simple learning algorithms running on fast machines require far more processing time than a tank receives each clocktick. We will describe how we have worked around this issue. We have found that source is available on request for three 'bot Brains:

Indy 2.02: One of the best robots with wonderfully clean C code source.
aIndy: An advanced Indy with better play but messier code.
Standard Autopilot: The Bolo author's original sample Brain. Not very intelligent and messy C source.

We therefore started by making modifications to Indy 2.02 code. (See the Brain list for more information on the available Bolo Brains.) Indy 2.02 was written by John Lawrie (jlawrie@sfu.ca) and Gersham Meharg (meharg@sfu.ca).

Computing Features

A Bolo brain tank has access to the following types of features, any of which might be needed by a learning algorithm to recognize certain types of actions:

The large terrain map, containing terrain of types building, river, swamp, crater, road, forest, rubble, grass, halfbuilding.
Positions and relative positions of objects such as buildings, parachutes, mines, boats, shots, tanks, refbases, pillboxes, and men.
How objects like tanks and shots move, and where they are positioned.
Orientation of objects like tanks, shots, and boats.
Condition of other objects like damage to pills and buildings.
Classifications of objects as hostile or neutral.
Messages passed from other players or tanks.
When events happen.
Combinations of the above.

See Bolo Tidbits for a list of other information available to the robot that can potentially be used for feature computation.

Given the vast number of potential features, pruning the set to a subset that the learning algorithm can work with requires using some domain knowledge about the task and making guesses about which features truly provide the most information for the task at hand. For instance, we know that since the actions we want the robot to recognize are very local, we do not provide the learning algorithm with information about tanks out of the immediate area.

There is no direct way to access features from the game with temporal exent (e.g. derivatives of properties), although they could be computed with some effort. We have not used temporal features here.

The Learning Algorithms

The goal of the learning algorithm is to take as input a set of features and corresponding decisions and to produce a compact set of decision rules. The rules are used to evaluate incoming features and make decisions about what action is taking place. PRISM is an algorithm based on ID3 for computing disjunctive rules of features from a given input set of features and decisions. Features are supplied in a form such as ((shooting tank-near base-far ATTACK) (not-shooting tank-near base-near NOT-ATTACK)) and the resulting rules are of the form ((shooting & tank-near -> ATTACK) (base-far -> NOT-ATTACK)). For details on the algorithm, see the original article, [J. Cendrowska. "PRISM: An algorithm for Inducing Modular Rules." J. Man-Machine Studies (1987) 27, 349-370.]

PRISM will produce a correct rule to fit every possible combination of features. Consequently, it tends to over-fit data, producing a specialized rule for too many situations. INDUCT, described in Gaines89, is an extension to PRISM which tries to eliminate overly specialized rules at the expense of possibly generating some incorrect rules. Our data violates two assumptions made by PRISM: it is not complete and our examples are not always correct. Consequently, PRISM discovers some inconsistencies and also generates too many specialized rules. Therefore, we have applied INDUCT to the input. We are not guaranteed correctness and we don't avoid inconsistencies, but some over-fit rules are avoided.

Training the 'Bot

A typical training session with our learning 'bot, Inducting Indy, begins by making sure that both the 'bot and the human player are allied to each other. In the course of the game, the human player shows Inducting Indy examples of an activity (e.g. "attack the pillbox") that the players wants the 'bot to perform. Just before attacking the pillbox himself, the player sends a message indicating what action he is performing. The inducting 'bot recognizes this message and accordingly tags the features of the world that it stashes to disk.

In Bolo, sending messages to other tanks in a timely fashion can be cumbersome when the player is trying to control his own tank. In our system the player can send one of a set of preconfigured messages with a single keystroke; we've also experimented successfully with Apple's PlainTalk speech recognition software and AppleScript to link spoken messages to the keystrokes that actually send the messages.

When the player stops the action that he is demonstrating, he then sends Inducting Indy a message indicating that the action is over for now.

After a fixed amount of time in the game (a few minutes) the action rules are recomputed by the INDUCT algorithm described above. The inducting 'bot then uses these new rules to classify the actions of the player. Later, if the player happens not to tell the 'bot what he's doing, and if a particular rule fits the set of features, our 'bot concludes that the player must be engaged in the action corresponding to the rule. He then adopts this action as his own.

During the course of the game, when the 'bot is comparing the state of the world from the player's point of view against its set of rules, it's also free to pursue its own goals (the ones built in by the creators of Indy) so long as it doesn't wander too far from the player that it can't see what he's doing, and none of the rules is firing.

The training doesn't stop when a single example has been shown to the 'bot, however. The player is free to show the 'bot more examples of the action. In fact, this must happen for INDUCT to derive rules that are sufficiently general to be useful. For example, the rules derived from a single example of "attack the pillbox" will probably count on the particular pillbox being attacked, the direction the tank is pointed, and a host of other features which in this case happen to be irrelevant. Since each time the rules are recomputed from the entire set of examples from the whole session, the rules continually evolve, becoming more general and therefore more useful.

This periodic computing of the rules permits a style of interactive "teaching" that should be useful in a game situation. After showing the 'bot a few examples of an action, the player can then see if the it has learned the appropriate concepts by seeing if it does the right thing in a new situation. If it does -- great. If it doesn't, the player can add a training example on the spot rather than declare a total failure of the 'bot's configuration.

The learning algorithm will sometimes receive one set of features with two different decisions. For example, temporal delays by the user in labeling may lead to a (shot-in-air ATTACK) input and a (shot-in-air EXPLORE) input. A shot may still be in flight when the user switches from attacking to exploring. This type of contradictory input generates unresolvable conflicts in the PRISM and INDUCT learning algorithms. Therefore, we have added a modification to the learning code that keeps track of the number of each type of example seen over the entire game. If 10 examples of (shot-in-air ATTACK) are received and only 2 examples of (shot-in-air EXPLORE) are received, then only the (shot-in-air EXPLORE) rule will not be used in training. However, if at a later time 9 more examples of (shot-in-air EXPLORE) are received, then there will be 1 more example of that case than (shot-in-air ATTACK), and (shot-in-air EXPLORE) will be the rule that is used. This method of keeping track of multiple examples and eliminating conflicts based on the number of examples received allows an action to be trained and then retained even if a few spurious inputs are received that directly contradict the data used to learn the rules.

Action

Once the Inducting Indy 'bot has recognized some behavior, it tries to respond. Currently it can recognize attack-tank, attack-base, attack-pill, fortify-base, hiding, and exploring. It can respond by attacking the appropriate object (using different strategies depending upon the object), fortifying a base with mines, or hiding in the same region of forest as the player. Temporal smoothing of the 'bot's actions is achieved by extending each action for a few seconds in time. For example, if a HIDING decision is made, the tank will try to hide for the next few seconds unless it receives another action, like ATTACK-TANK, in which case it will stop hiding immediately and respond to the new behavior. When no actions are recognized, Inducting Indy goes about its normal business as it follows the player around.

The kind of interactive, incremental, learn-as-you-go method of training described in the previous section seems particularly suited to teach collaborative agents.

But it's also useful in teaching the teacher. Consider what Inducting Indy is actually doing: it's recognizing how you go about attacking a pillbox or whatever. Later, when it recognizes you attacking the pillbox, it proceeds to attack the pillbox in it's own way. The teacher stands to learn how in this case the original Indy code goes about attacking a pill, which in many cases (certainly in ours) may be more effective than our training examples.

Examples

When training all of the recognized actions simultaneously, rule sets of 100 to two hundred rules are created. Here's an example set of rules for a small test case where HIDING, FORTIFY-BASE, and ATTACK-PILL were quickly trained by playing the game for about 10 minutes:

dec-hiding <= moving-slowly terrain-forest nearest-obj-at-29-31 dec-hiding <= stationary terrain-forest dec-hiding <= terrain-forest nearest-obj-at-30-31 dec-fort-base <= lgm-far nearest-obj-at-73-46 not-facing-nearest-obj terrain-refbase dec-fort-base <= nearest-obj-at-63-43 stationary terrain-refbase dec-fort-base <= is-shooting terrain-refbase dec-fort-base <= lgm-close terrain-refbase dec-att-pill <= stationary nearest-object-info-3 target-at-70-71 dec-att-pill <= refbase-close lgm-near nearest-object-info-3 dec-att-pill <= refbase-close moving-fast pillbox-close bman-nearest-obj dec-att-pill <= terrain-crater dec-att-pill <= target-at-68-57 nearest-object-info-3 dec-att-pill <= nearest-pillbox-info-3 moving-slowly bman-nearest-obj dec-att-pill <= nearest-obj-at-68-57 stationary lgm-near target-at-70-71 dec-att-pill <= is-shooting target-at-70-71 dec-att-pill <= moving-slowly is-shooting dec-att-pill <= shot-close

These rules were sufficient so that the player could attack an enemy pill and Inducting Indy would then start attacking the same pill. In addition, the player could hide in the forest and Inducting Indy would come hide nearby in the same forest. Finally, if the player started to fortify a base, Inducting Indy would come help out. The rules are not perfect, but given more training they would cover more cases. In general, even this small set of rules led to interesting (and useful) performance by Inducting Indy. One 'bot called Speilborg generates a quicktime segment of a bolo game (see Bolo Brains for more info). We have used Speilborg to generate a QuickTime movie of Inducting Indy and overlayed sound, so that we can describe the training as the game progresses. This example segment shows a short sequence of training the Inducting Indy 'bot to attack a pill. Inducting Indy can then recognize an ATTACK-PILL action and respond by also attacking the same pill. Conclusions and Possible Extensions The system demonstrates that it is possible to adapt a machine learning technique like PRISM to an interactive, real-time situation for the purposes of teaching collaborative behavior. It works well ... better than we had expected. Some possible extensions are: When the player gives the 'bot an example of an action, the player sometimes forgets to tell the 'bot that he has stopped performing the action (the training example has ended). This is not surprising, since when we demonstrate an action in real life, we rarely have to worry that the student might not realize that the demonstration is over when we leave the room to get the phone (for example). Our 'bot might replicate this behavior by watching how the world features change in the course of a training example. For example, if after a bit a number of the features change, it might conclude that the demonstration is over. Currently there is no way to incorporate the player's negative feedback into the 'bot's learning. The idea is to stop the 'bot when it wrongly decides to adopt some behavior. Partly this is a user-interface issue (should the player just send a message to it or should it ask for confirmation when it's unsure?) and a learning algorithm issue (should the offending rule and the examples that made it happen be thrown out?). The are other interesting PRISM-like learning algorithms that might be more suited to our task. For example, David Maulsby's Cima system, derived from PRISM, includes ways to assign saliency to features, factor in user's hints and all sorts of ways to compute rules and take advantage of the interactive nature of the learning at the same time. While we knew about Cima when we started the project, our view was that it was probably smart to start with a simple algorithm first. During training the player may be in a position to tell the 'bot when a particular feature is irrelevant (or particularly relevant). For example, while demonstrating a particular behavior like exploring, for example, the player may indicate that the direction the tank is facing is irrelevant, or that the fact that the tank is on the move is relevant. The problem here is that such a facility assumes that the player knows the set of features that the 'bot is calculating. One answer is to simply go from a spoken keyword to a particular feature or set of features. Right now, the 'bot recognizes only features collected from a single frame of the game; it does not recognize sequences of actions. One interesting extension involves permitting it to recognize (and execute) sequences of actions from features collected over multiple frames. Discrete Hidden Markov Modeling or Q-learning may be useful for recognizing sequences. Currently, once the 'bot is trained, it responds to what it thinks the player is doing. However, if the player is still training, sometimes the player will indicate one action and the 'bot will predict another. While the PRISM algorithm will eventually resolve these inconsistencies, the fact that a direct contradiction has been confirmed is not used in any special way, say to provide weights during rule construction. One of the last phases of teaching collaborative agents might be called "delegation". That is, eventually the teacher is so confident in his student that he no longer has to be around to watch him. Also, the only kind of collaborative behavior shown by our system is that of adopting the same behavior and the same object of that behavior. Another kind of collaborative behavior (sometimes called "divide and conquer") involves distributing tasks. In advanced phases of learning collaborative behavior, both these styles of behavior, delegation and divide-and-conquer, need to be developed. Anyone looking for a Ph.D. topic? Technical Details On the Mac Indy 2.02 C code was ported to compile in MetroWerks CodeWarrior on a PowerPC. The Think C version had a code segment size limitation. Indy was modified in several ways: A "follow closest ally tank" mode was added so the robot will tag along behind the player's tank. The code was modified so that the 'bot can be told to perform specific behaviors at specific times, such as attack base, attack pill, attack tank, hide, and fortify base. Code to compute and discretize features with respect to the tank being followed was added to Indy. As it follows, it writes the followed tank's features to a file a few times a second. The Indy code was modified so that it reads rules computed by the learning algorithm in LISP and compares to rules to incoming features and then reacts accordingly. For example, when the ally tank being followed starts attacking an enemy tank, the follow tank will attack until the followed tank stops. The PRISM and INDUCT learning algorithms were implemented in Common LISP for flexibility and ease of programming and run under LispWorks on a fast HP735 workstation. They read rules from a file output by the mac Bolo robot, compute rules, and output the rules to a file read by the Bolo robot. Some effort was required to get the learning algorithm and the Bolo Brain communicating via files. Presently, the algorithm runs fast enough that it could probably be ported to MacLISP. A significant amount of time was required to write feature computation routines and to tweak which features should be provided to the learner. Testing the algorithm required running two or three tanks simultaneously. Testing was especially difficult, given our ineptitude at actually playing the game! A new game map was created to make the training a bit easier. Using/Modifying/Fixing the Code The C and Lisp code for Inducting Indy is available to anyone who wants it. Send a request via email. The version here ran on a Unix workstation running Lisp and communicated with the Bolo C process via files. This was a kludge that requires fixing. If you want to use Inducting Indy you have two options: either convert the Lisp code to MacLisp and get the C and Lisp processes communicating properly, or rewrite the PRISM algorithm in C code. Andrew Wilson and Stephen Intille were graduate students working on action recognition and computer vision in the MIT Media Lab Vision and Modeling Group at the time of this project. We'd appreciate any comments that you may have on this work or on possible extensions. Last modified: 4/97