Many decades ago, an interface - the “instruction set” - was invented and it allowed software and hardware to develop independently.  The typical instruction set was a mathematized idea of what a computer should be - precisely describable, reliable and perfect in its execution, divorced from the messy behavior of the physical world.  (The x86 instruction set is perhaps the most famous and commercially important such interface.)  With the abstraction of “a computer” fairly well standardized, Computer Science was able to split off from Electrical Engineering.  Programming languages were invented, algorithms were developed, and millions of programmers were trained.


As decades passed, hardware engineers worked miracles, providing implementations of the interface (i.e., computers) that ran existing software thousands of times faster than in earlier days.  That miracle was, in turn, built substantially on the miracle worked by physicists and material scientists, who have brought us to the age of the ten billion transistor silicon chip.


However, using billions of transistors to build traditional computers, which originally were implemented using thousands of transistors, is an exercise in diminishing returns.  The fabulous benefit to programmers of being able to run old software, and to think in idealized mathematical ways, incurs shocking costs below the software/hardware interface (i.e., in the chips).  Individual transistors can perform complex and useful operations, such as exponentiation and logarithms to a fair degree of accuracy, but we have asked hardware engineers to hide the incredible richness (and complexity) of a billion such devices.  On most clock cycles, our software performs just a single increment, or a load register, or occasionally a multiply, using the collective effort of the fantastically powerful physical machinery underneath.


These observations are well known, especially to those on the EE side of the gulf.  Carver Mead, for instance, noted them decades ago and developed his ideas of neuromorphic computing, a famous and fascinating line of research.  Despite the advances that have followed, the impact on the real world has not achieved the potential the ideas seem to suggest.  I believe one of the reasons is that practice of the ideas is limited to a very small set of people: those select electrical engineers, who do chip design, in the analog domain, primarily using sub-threshold CMOS circuits.


The work described here, to my knowledge, is the first to explore the possibility of retaining the software/hardware split, but developing a new interface that exposes the key realities of silicon to software and then requesting software to deal with those unpleasant realities in exchange for possibly a million times improvement in speed or energy consumption.


To the extent this can be done, and to the extent the performance improvements sufficiently dwarf the pain of programming, a new kind of computing engine might be added to the CPUs, and increasingly the GPUs, that make up the infrastructure on which application software is built.  There is a chance that the impact on computational science, government needs, medicine, consumer applications, and other domains will be profound.


The technical approach


The approach I’m exploring is technically simple.  We want a machine design that does not force hardware engineers (or allow software engineers) to deny the existence of fundamental physical reality - that chips (today) are essentially two dimensional, that communication at a distance costs time or energy, that physical machines (at current scales) naturally behave in approximate ways, that the amount of matter needed to store some information is roughly the same as the amount of matter needed to compute something.  These constraints do not fully determine a machine design, but they are suggestive of how one might proceed.


My approach is a particularly simple variant of what these constraints suggest.  We imagine a large grid of computing elements.  They are laid out in two dimensions, because they must be.  They are connected only locally, or nearly only locally, because they must be.  They each have a modest amount of storage and are able to perform approximate arithmetic (I have explored high dynamic range arithmetic where the results of each operation are accurate to a few percent - like floating point with very short mantissas).  We drive these processing elements (PEs) from a single control unit, running a single sequential program that essentially includes array-manipulating instructions.


Thus, the machine is a particularly simple SIMD (single instruction stream, multiple data stream) architecture, such as were studied and commercialized in decades past (MasPar, Goodyear MPP, Connection Machine, DAP, and others).  The main differences are that each PE can do a kind of floating point arithmetic in about one cycle per operation, and that the weak requirement on arithmetic accuracy allows hardware designers to make each PE so small in transistor count that hundreds of thousands can fit on a modern chip.  Thus, we get a single chip that can perform most of a petaFlop, running on about 100 watts, and suitable as a desktop accelerator.  A few such chips would give a grad student the power of the world’s fastest supercomputer.  A reasonably priced data center could be built around such chips with computing power exceeding that of the rest of the planet.  An alternative implementation is a smaller chip that brings the computing power of a desktop GPU (teraFlop) to sub-watt mobile devices.


The question about such hardware is the extent to which useful software can be built.  Traditional SIMD machines of various architectures were shown fairly useful when they were intensively studied in the 1980s.  There are important limitations (off-chip bandwidth, limited local memory, and others), but many tasks can be programmed with sufficient efficiency to be exceptionally fast or low power compared to alternative approaches.


While there are more flexible parallel architectures that one might wish to use, an argument can be made that since the present machine attempts to expose to software the true nature of our hardware fabrication technology, the limitations it embodies can’t be avoided and must be faced by algorithm designers - or else ignored by them and dealt with by hardware engineers in ways which incur precisely that terrible overhead we are seeking to avoid.


The most unusual aspect of the machine is the approximate arithmetic.  Numerical analysts have studied the effect of machine precision on various algorithms, but those tend to be numerical algorithms.  Whether AI, robotics, perception, learning, and related tasks can be implemented using approximate arithmetic is not immediately obvious, at least to me.  One’s intuition might suggest that algorithms for these tasks need to be robust to noisy data, inaccurate models, the complexity of the world, and other sources of error, and that approximate arithmetic might not cause much additional difficulty.  Experiments done by me, and by others at MIT and elsewhere, suggest that this often is the case.


We’ve explored tasks in a variety of domains and have found that often known algorithms can be modified to run on a machine of the sort proposed while producing useful results.  The domains we’ve explored, in varying degrees, include image processing, tomography, machine learning, pattern recognition, molecular dynamics, genomics, and speech understanding.  At the Media Lab we’re exploring video processing with a focus on tracking humans in video.


The basic technology is rather well developed.  I believe the next step is finding specific domains of enough proven economic value to justify the costs of hardware scale-up and deployment.  I’d be happy to talk with anyone around the lab (or elsewhere) about possible such applications.


Here is the PDF of a talk I’ve given which describes the work somewhat further.