Reinforcement Learning applied to a Radar Tracking Task

Nitin Sawhney


May 14, 1999

MAS 963: Learning and Training Techniques for Dogs and Other Interactive Characters


Goal

To incorporate a form of delayed reward-based learning for a novel, circular target search task.

Scenario

Design a radar that quickly learns to track the position of moving "targets", while remaining agnostic to non-target objects within its range. In this scenario, a radar resolution is defined (i.e. size and number of cells in the circular grid) and a varying number of targets and non-targets are randomly placed on the grid. All objects move towards the center at a constant speed. Targets that are closer to the center provide proportionally greater reward, if found. However, non-targets trigger negative reward when located. Hence the system must try to maximize its total reward by locating optimal paths to the closest moving targets while reducing interaction with non-targets.

Learning Technique

The goal is for the system to explicitly explore its environment before converging to optimal paths towards maximal reward. Delayed reward suggests that the agent take into account the forthcoming state and immediate reward as the basis for selecting its current action. This set of techniques is known as temporal difference methods. Here the agent learns from delayed reinforcement, after a long sequence of actions gradually arriving at states that provide high reinforcement. The agent must be able to learn which of its actions in particular states were most desirable, based on rewards that can occur arbitrarily in the future. [Kaebling 96]

An action ’a’ is taken by selecting the maximum Q-value Q(s,a) for the current state ‘s’, based on the following Q-learning rule:

Q(s,a) := Q(s,a) + a ( r + g maxa` Q(s`, a`) - Q(s,a) )

Here Q(s`, a`) represents the value for the next state-action pair and a reward ‘r’ is received for that transition. The discount factor g , (a value between 0 and 1) controls the extent to which learning is concerned with long-term vs. short-term consequences of its actions (serves as an interest rate on learning). The a is the learning rate, which sets a tradeoff between exploration vs. exploitation. In the simulation, a is decreased slowly to ensure convergence.

The optimal policy p is a mapping from states to actions, which maximizes a long run measure of reinforcement. In this case it is based on maximizing the value of states assuming the best actions are taken.

p * (s) = argmaxa Q*(s, a)

Here Q*(s, a) is the expected discounted reinforcement of taking action a in state s and choosing future actions optimally.

In Q-learning, the system will eventually converge to optimal values regardless of how the system explores the state space, as long as all state-action pairs are tried often.

Simulation Design

The radar display is represented by a set of concentric rings and to create a grid of target location cells axis (in this example a 10x30 grid is created). The grid is scaleable, hence the radar’s resolution can be changed easily to simulate larger state spaces. The current state space consists of 300 cells, each storing its currently assigned value (all states initialized to zero). A separate vector (unknown to the learning system) maintains the actual dynamic locations of all targets shown on the radar display. The system is allowed only four actions to search its space: move inward/outward on any axis or move clockwise/anti-clockwise on the concentric rings. A search "probe" wanders through the grid, initially along each axis as it sweeps the grid (clockwise). When the probe encounters a potential target, the value of that location in the radar is updated with a positive reward (proportional to the closeness of the target to the center of the radar). Non-targets activate proportionally negative rewards in a similar manner, while all other locations simply provide a unit negative value.

For the simulation, a discount factor (g ) of 0.9 was used along with an initial learning rate (a ) of 0.5 which was slowly decreased by 0.99 of its value at each iteration. Over time, the values in each cell of the grid get continually updated (turning form light to dark blue) to reflect the maximal chance for reward in those locations. If the Q-value for a particular state remains above the maximum current Q value (for all states) for a specified number of iterations, it is detected as a likely target and eliminated from the display. Hence gradually the system explores the circular grid, recovers likely locations of maximal reward, creates optimal paths to them and eliminates the closest ones first. Eventually all targets are (usually) eliminated unless targets are moving far too rapidly (i.e. moving in within less than 400 iterations), which does not give the system sufficient time to reinforce their locations for maximal reward.

Try the Java Simulation of the Reinforcement Learning Radar. Vary some of the parameters and see for yourself how it affects it performance.

Note: Simulation works best on Netscape Java VMs rather than Internet Explorer

Future Work

To provide faster converge in fewer learning iterations, one should consider other methods such as Dyna [Sutton 91] and prioritized sweeping [Moore and Atkeson 93] (which were not explored in this project). However Dyna does a relatively undirected search and does not explore well the rewarding (in the long-term) parts of the state space. In prioritized sweeping. States remember their predecessors by employing transition probabilities between states under certain actions. The highest priorities are updated in a manner such that information form optimal goal states is propagated back to relevant prior states. Although it is a large improvement over Dyna, both these techniques require greater computation resources.

References

Kaelbling, L.P. and Littman, M.L. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, vol. 4, 1996, pp. 237-285.

Andrew W. Moore and Christopher G. Atkeson. An investigation of memory-based function approximators for learning control. Technical Report, MIT AI Lab, Cambridge, MA, 1992.

Sutton and Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.