The philosophy behind the design of the Cheops processor module is to abstract out a basic set of computationally intensive operations required for real-time performance of a variety of desired applications. These operations are then embodied in specialized hardware provided with a very high throughput memory interface, and controlled by a general-purpose processor. Although image data tends to be large, it is accessed in a regular fashion, and the operations needed tend to work within a limited data window (often <1Ksample). Rather than storing all the data locally, the specialized processors operate upon one or more high speed streams of data. Stream processors are not required to input and output data simultaneously; some act only as destinations or as sources, while processors that require very large pipeline delays may do both but in separate transfer phases. The general-purpose processor can read and write control registers and local memory on the stream processors, as well as read results of calculations on the streams -- the latter particularly the case on destination-only processors like block motion estimators. This access is via an 8-bit-wide control register datapath which supports transfer rates of 8Mbytes/sec.
Figure 1: Highly simplified diagram of the Cheops processor. Blocks labeled SP represent stream processors. Not shown is the control register datapath by which the 80960 processor configures and controls the stream processors. Note that the "bidirectional" datapaths to the crosspoint are actually separate 16-bit input and output paths; note also that a stream processor may in some cases accept data from two input streams or produce two output streams.
The processor module comprises eight memory units communicating
through a full crosspoint switch with up to eight stream processing
units. As it was possible to build a full crosspoint switch capable of
40MHz operation with just four chips (LSI Logic L64270 devices), we
chose to implement our interconnection network in this fashion rather
than to use a more compact arrangement of switching elements. Each
memory unit is made up of dual ported dynamic memory (VRAM) and a
two-dimensional direct memory access (DMA) controller for transferring
a stream of data through the crosspoint at up to 40Msample/sec.
Specialized stream processors attached to the switch perform common
mathematical tasks such as convolution, correlation, matrix algebra,
block transforms, spatial remapping, or non-linear functions. These
processing units are on removable sub-modules, allowing
reconfigurability and easy upgrade.
A general-purpose central processing unit (CPU) -- an Intel 80960CA or CF, clocked
at 32MHz -- is provided for sequencing and controlling the flow of
data among the different functional units, implementing the portions
of algorithms for which a specialized processor is not available, and
performing higher-level tasks such as resource management and user
interface.
The DMA controllers on each VRAM bank, called "flood controllers,"
are capable of relatively agile handling of one- and two-dimensional
(and to a lesser extent, three-dimensional) arrays. When acting as a
source, a flood controller can replicate or zero-pad independently in
two dimensions, and when acting as a destination it can decimate.
Hence fractional sampling-rate conversion can be performed in a single
transfer through a filter stream processor. The functionality of a
flood controller, especially in conjunction with the transpose stream
processors discussed below, is thus similar to the multidimensional
stream functions of the intensional data-flow language Lucid. The Nile Bus interface on
non-processor modules is effectively a flood controller also, and Nile
transfers are handled in a similar fashion to transfers within a
processor module.
Because the switch is a full crosspoint, it is possible to connect
a memory bank to another memory bank, to cascade processors, or to
send one stream to several destinations simultaneously.
The actual data transfers through the crosspoint switch are
synchronized by handshaking lines called "OK channels." Every flood
controller and stream processor connects to every OK channel. In
setting up a transfer, a process selects an unused OK channel and sets
the involved source and destination flood controllers and stream
processor to use this channel. The destinations are also informed of
the pipeline delay associated with the datapath and stream processors.
The OK channels are the logical AND of the OK outputs of all of the
participants in a transfer. When all participants are ready to
transfer data, the sources begin transferring. After an amount of
time equal to the pipeline delay of the transfer, the destinations
begin writing result data into memory. The current processor design
has three OK channels, and thus at most three transfers may be taking
place at one time on a processor module, though any number of memory
banks and stream processors may be involved.
The Nile Bus is used to transfer large blocks of data to/from
other modules in the system. There are two of them (White and Blue)
in a Cheops System, but the P2 processor module can only access one
bus at a given time. The processor module interfaces to the Nile
Buses via a two/one mux (selecting one of two sets of source memory
units). The pixel data being transferred is stored internal to the
processor module in three separate memory units, one component channel
per unit. As the pixel data is transferred to/from the Nile Bus, it
passes through a color-space converter (a 3 x 3 matrix multiplier and
lookup tables). This processor decouples the pixel color
representation used for display/input from that used for processing,
and may also be used to process data within a P2 module.
The processor module communicates with the host computer using a SCSI
(Small Computer Systems Interface) bus. Each processor module appears
as a fixed disk device, therefore in most cases no special host device
driver is needed; Cheops systems have proven to be compatible with
existing UNIX disk device drivers on DECStations, Sun SPARCStations,
IBM RS/6000 systems, and Silicon Graphics workstations. Two RS-232
serial ports are available, one for debugging and diagnostics, and the
other for user interface devices such as serial knob boxes, mice, or
touch-sensitive screens. If only a low-bandwidth host interface is
desired, one of the ports may substitute for the SCSI connection.
The Local Processor
"Smart" Memory Units
The Switching Hub
The Nile Bus Interface
Peripherals
Stream Processors
Jump to Cheops Homepage
Jump to P2 FAQ
cheops-web@media.mit.edu
This is a "fix it yourself" page, located
at /mas/garden/cheops/WWW/p2/index.html