The P2 Processor Module

Processor Architecture
Local Processor
Memory Units
Crosspoint Hub
Nile Bus Interface
Peripherals
Stream Processors

Processor Architecture

The philosophy behind the design of the Cheops processor module is to abstract out a basic set of computationally intensive operations required for real-time performance of a variety of desired applications. These operations are then embodied in specialized hardware provided with a very high throughput memory interface, and controlled by a general-purpose processor. Although image data tends to be large, it is accessed in a regular fashion, and the operations needed tend to work within a limited data window (often <1Ksample). Rather than storing all the data locally, the specialized processors operate upon one or more high speed streams of data. Stream processors are not required to input and output data simultaneously; some act only as destinations or as sources, while processors that require very large pipeline delays may do both but in separate transfer phases. The general-purpose processor can read and write control registers and local memory on the stream processors, as well as read results of calculations on the streams -- the latter particularly the case on destination-only processors like block motion estimators. This access is via an 8-bit-wide control register datapath which supports transfer rates of 8Mbytes/sec.

Figure 1: Highly simplified diagram of the Cheops processor. Blocks labeled SP represent stream processors. Not shown is the control register datapath by which the 80960 processor configures and controls the stream processors. Note that the "bidirectional" datapaths to the crosspoint are actually separate 16-bit input and output paths; note also that a stream processor may in some cases accept data from two input streams or produce two output streams.

The processor module comprises eight memory units communicating through a full crosspoint switch with up to eight stream processing units. As it was possible to build a full crosspoint switch capable of 40MHz operation with just four chips (LSI Logic L64270 devices), we chose to implement our interconnection network in this fashion rather than to use a more compact arrangement of switching elements. Each memory unit is made up of dual ported dynamic memory (VRAM) and a two-dimensional direct memory access (DMA) controller for transferring a stream of data through the crosspoint at up to 40Msample/sec. Specialized stream processors attached to the switch perform common mathematical tasks such as convolution, correlation, matrix algebra, block transforms, spatial remapping, or non-linear functions. These processing units are on removable sub-modules, allowing reconfigurability and easy upgrade.

The Local Processor

A general-purpose central processing unit (CPU) -- an Intel 80960CA or CF, clocked at 32MHz -- is provided for sequencing and controlling the flow of data among the different functional units, implementing the portions of algorithms for which a specialized processor is not available, and performing higher-level tasks such as resource management and user interface.

"Smart" Memory Units

The DMA controllers on each VRAM bank, called "flood controllers," are capable of relatively agile handling of one- and two-dimensional (and to a lesser extent, three-dimensional) arrays. When acting as a source, a flood controller can replicate or zero-pad independently in two dimensions, and when acting as a destination it can decimate. Hence fractional sampling-rate conversion can be performed in a single transfer through a filter stream processor. The functionality of a flood controller, especially in conjunction with the transpose stream processors discussed below, is thus similar to the multidimensional stream functions of the intensional data-flow language Lucid. The Nile Bus interface on non-processor modules is effectively a flood controller also, and Nile transfers are handled in a similar fashion to transfers within a processor module.

The Switching Hub

Because the switch is a full crosspoint, it is possible to connect a memory bank to another memory bank, to cascade processors, or to send one stream to several destinations simultaneously.

The actual data transfers through the crosspoint switch are synchronized by handshaking lines called "OK channels." Every flood controller and stream processor connects to every OK channel. In setting up a transfer, a process selects an unused OK channel and sets the involved source and destination flood controllers and stream processor to use this channel. The destinations are also informed of the pipeline delay associated with the datapath and stream processors. The OK channels are the logical AND of the OK outputs of all of the participants in a transfer. When all participants are ready to transfer data, the sources begin transferring. After an amount of time equal to the pipeline delay of the transfer, the destinations begin writing result data into memory. The current processor design has three OK channels, and thus at most three transfers may be taking place at one time on a processor module, though any number of memory banks and stream processors may be involved.

The Nile Bus Interface

The Nile Bus is used to transfer large blocks of data to/from other modules in the system. There are two of them (White and Blue) in a Cheops System, but the P2 processor module can only access one bus at a given time. The processor module interfaces to the Nile Buses via a two/one mux (selecting one of two sets of source memory units). The pixel data being transferred is stored internal to the processor module in three separate memory units, one component channel per unit. As the pixel data is transferred to/from the Nile Bus, it passes through a color-space converter (a 3 x 3 matrix multiplier and lookup tables). This processor decouples the pixel color representation used for display/input from that used for processing, and may also be used to process data within a P2 module.

Peripherals

The processor module communicates with the host computer using a SCSI (Small Computer Systems Interface) bus. Each processor module appears as a fixed disk device, therefore in most cases no special host device driver is needed; Cheops systems have proven to be compatible with existing UNIX disk device drivers on DECStations, Sun SPARCStations, IBM RS/6000 systems, and Silicon Graphics workstations. Two RS-232 serial ports are available, one for debugging and diagnostics, and the other for user interface devices such as serial knob boxes, mice, or touch-sensitive screens. If only a low-bandwidth host interface is desired, one of the ports may substitute for the SCSI connection.

Stream Processors

State Machine - Programmable hardware
Mojo - Integer Multiply-Accumulate array
Remap - Arbitrary mapping
DCT - Discrete Cosine Transform
Splotch - Arbitrary large basis vector compositer
Transpose - Matrix reordering

Jump to Cheops Homepage
Jump to P2 FAQ

cheops-web@media.mit.edu