Distributed Objects case study
Distributed objects can roam freely across the network, between RAM and
disk drives, preserving their identities and functionality. They provide
an easy route to making object-oriented programs distributed. However,
keep in mind that the developer must still decide how objects should be
distributed, which can have a significant impact on system performance.
The developer must also provide synchronization, since distributed systems
are inherently parallel.
Distributed objects are also a key enabling technology for component
software. Naturally, just because your objects are distributed doesn't
mean that they can be combined with someone else's objects to form an
application. Component software requires a library on top, for linking,
embedding, data transfer, and automation. We will discuss such libraries,
e.g. OpenDoc and ActiveX, in the next case study.
Distributed Shared Memory
Distributed shared memory (DSM) is making another machine's virtual memory
look like more of your virtual memory. DSM is the core idea behind
distributed objects, as well as distributed file systems and cache-coherent
multiprocessors. Many of the issues in implementing distributed systems
are also DSM issues.
DSM provides three key services:
- Persistent, distributed naming
-
Conventional languages only support naming in the form of pointers, which only
make sense in the context of a particular process on a particular machine.
DSM lets a name apply to any data throughout the network, even if the data
migrates. You can think of a persistent name as being a really big pointer
(e.g. 128 bits) that never goes out of date until the object, not
a particular process or machine, dies. The Domain Name System (DNS)
is an example implementation.
- Global invocation
-
Conventional languages only allow invoking an object which is in the same
process. Object-based DSM lets you call methods on any object that you
can name. It is the object-oriented version of Remote Procedure Call (RPC).
- Migration
-
Conventional languages only support movement through copying. This doesn't
work with a bank account, which must have exactly one instantiation at any
time. Imagine what would happen if I simply copied my account to another
machine. The withdrawals I make on one machine would be invisible to the
other machine, so suddenly I've doubled my money! Migration allows
my account to move from machine to machine, based on who wants to use it,
while remaining consistent.
What follows is a brief tutorial on DSM. You can learn more about DSM from Computer
Architecture: A Quantitative Approach (2nd ed) or from the many MIT
courses on computer architecture and systems. For pioneering research
papers, see the Orca home page.
Design issues
There are four orthogonal issues in designing a DSM.
The first issue in DSM is the unit of distribution.
Here are three possibilities:
- Page-based DSM
-
Distribute the virtual memory pages.
A disadvantage is that objects may span page boundaries, or may happen to
live together a single page, causing false sharing (two processors
needing different parts of a page).
- Shared variable DSM
-
Distribute individual memory locations. Essentially page-based consistency
at a finer granularity. False sharing is nonexistent. Advantageous when
the number of shared locations is small, because of the overhead for each
location.
- Object-based DSM
-
Distribute object state. By using medium-sized units
that are meaningful to the application, it simultaneously enjoys low
overhead and low false sharing.
Object-based distribution is the basis of distributed objects.
Compared to the others, it has the additional complications of:
- Embedded links. Do embedded object references have to be global?
If not, what happens when an object moves?
- Indirection overhead. Private object state can only be accessed
through methods.
- Inheritance. A type lattice may need to be traversed in order to
find the correct method implementation.
- Unknown method behavior. Operations on objects, unlike pages or
variables, can have arbitrary side-effects, like reading and writing other
parts of the memory, and they can have arbitrary duration.
For objects, a "read" means calling a method which does not change the
object state. A read can, however, change another object's state, which is
considered a write to that other object.
The second issue in DSM is migration, which can be automatic or
manual. Automatic migration seeks to place data at the weighted center of
its users, in order to minimize network traffic. One technique is to
periodically look at the origins of the last k requests and move
to their center. This should not happen too often or thrashing may occur.
Manual migration is when the application controls data movement. This is
needed for things like autonomous agents or archival.
The third issue in DSM is replication, which can be static or
dynamic. Static replication uses a fixed mapping for the replicas of an
object. It is usually used to eliminate network bottlenecks and for
fault-tolerance. Dynamic replication uses replicas to reduce network
traffic. For example, read replication uses local caching for
reads. However, writes become more expensive, in order to keep the caches
consistent. All writes must go to the primary copy, whose location may or
may not change, depending on whether the DSM uses migration. As usual with
the Observer pattern, write notification can
use an invalidate or update protocol. Also, someone must
keep track of the replicas.
The fourth issue is the memory consistency model. So far, we have
tacitly assumed sequential consistency, where all reads and writes
appear as though they were executed on a single memory location by a single
processor, i.e. they are totally ordered. However, this requirement is not
necessary for distributed applications which are well-behaved, i.e. have no
race conditions. Thus there are several so-called relaxed consistency
models, which work fine for well-behaved programs and allow more aggressive
DSM algorithms, e.g. caching writes. See the above references more
details.
Implementation
The simplest kind of DSM is the central server algorithm: objects
don't move after creation and are not cached by clients. All requests to a
remotely-created object entail an RPC. JavaBeans uses
this algorithm, probably because it is significantly simpler than ones
based on migration or replication. As usual, implementation is simple
until it has to run fast.
For scalable, high-performance DSM with migration and/or replication, the
most common implementation is to use a directory. For each memory
unit, the directory stores
- who "owns" the unit, i.e. has the primary copy
- who has copies of the unit
This technique scales because the directory can itself be distributed,
migratory, and replicated. This is analogous to virtual memories which
page their own page tables.
DSM applications must normally use an extended naming scheme for all data
pointers. However, if the DSM is object-based, the Proxy pattern can be used to avoid this. All
objects appear local; the truly remote ones have Proxies which forward
requests to them. Objects can easily migrate via the hot swap
technique. When they migrate, embedded references become Proxies.
Identical Proxies in the same address space can be merged.
Object-Oriented Database
A major disadvantage of conventional DSM is that it is based on virtual
memory, which is a proprietary format of the operating system. As soon as
you take an object out of virtual memory, e.g. to store it, it leaves the
global namespace.
Object-oriented databases extend DSM to include disk archives. In other
words, migration is not only from process to process but also to disk.
Persistent names still apply when an object has migrated to disk.
(Embedded references become persistent names when an object is stored.)
Invocation also works: the database will "resurrect" a receiver in storage,
invoke it, then save it again.
The Proxy pattern can still be used to make
remote or archived objects appear local. Proxies can selectively read in
parts of the object, rather than resurrecting the whole thing at once.
Object Request Broker
The last section was motivated by saying that conventional DSM is too
limited in scope. However, databases have simply widened the scope
slightly to include disk archives. They still require objects to be in a
proprietary format, which doesn't work for existing applications.
An Object Request Broker (ORB) is a DSM or database which is designed to be
open and interoperable. It can use data from legacy code or from other
ORBs. The openness is based on either a description language or a binary
standard. CORBA and COM are ORBs. CORBA takes the description language
approach while COM uses a binary standard. Neither provides all of the
services of a database, so they primarily function as shared memories.
If interoperability is based on a binary standard, the Adaptor pattern can be used to introduce objects
into the namespace. Otherwise, the object must be describable in the ORB's
description language. This allows the ORB to automatically create an
Adaptor.
A remote method call therefore looks like this:
For further reading, see the Broker pattern, available here,
or any of the numerous sites on CORBA and COM.
Broker Example
This example demonstrates an ORB with read replication, notification via
update, and no migration. The event timeline is:
-
Server creates Object.
-
Server registers Object with the ORB.
ORB assigns a fresh, persistent name.
-
Client receives a Proxy containing Object's name. For example, as the
return value of a method call on another object.
While physically different from Object, Proxy does not
have a persistent identity of its own.
-
Client sends a read message to Proxy.
-
Proxy is lightweight; it contains no data at first. Therefore, Proxy
contacts ORB, asking for Object's state.
-
ORB looks up Object's address and connects via Adaptor.
ORB serializes Object's state and sends back to Proxy.
Embedded objects are sent as additional lightweight Proxies.
-
ORB records Proxy as a replica of Object.
-
Proxy receives the state of Object, then executes the read request.
This may invoke other objects.
-
Client sends a write message to Proxy.
-
All writes must go to the primary copy. Therefore, Proxy gives the request
to ORB.
-
ORB gives the request to Object.
-
Object notifies ORB of a change in its state.
-
ORB broadcasts the new state of Object to all replicas, including Proxy.
-
Proxy receives the new state and is now finished with the write operation.
-
Client sends a destroy message to Proxy.
-
Proxy unsubscribes as a replica of Object and dies.
-
Server sends a destroy message to Object.
-
Object removes its name from the ORB and dies.
Description Languages
The example illustrates how much the ORB must know about your object.
It must be able to create a proxy, create an adaptor, serialize the object,
and anticipate side-effects. Where does it get this information?
Here are four possibilities:
- Do it yourself
-
Each class provides its own proxy, adaptor, serialization routine, and
list of "reader" and "writer" methods. This provides maximum control.
Unfortunately, it is a lot of work for old and new classes and requires
exposing some of the internal operations of the ORB.
- Use built-ins
-
Implement your classes in terms of a fixed set of built-in classes. For
example, Python, Smalltalk, and other dynamically-typed languages work this
way. All of the method calls are standard and can be efficiently
implemented in advance by the ORB. Unfortunately, this entails using a
particular programming language, which is one of the things ORBs are intended
to avoid.
- Deduce it
-
Deduce the information from the normal class declaration and
implementation. This would work except for the fact that most
programming languages were designed without distribution in mind. They
often leave key details out of class declarations that are only visible in
the details of the implementation. You could switch to a better language,
but that again is a closed solution.
- Describe it
-
Describe the object in a special language, and generate code from the
description. This is good for foreign objects since they just need to be
described, not rewritten. You need to describe the object's interface,
state, and side-effects. (You must use another language for the
implementation, but a simple conversion tool can give you a head start with
the declarations.)
These can be used simultaneously. For example, a class could have its
proxy and adaptor generated from a description language, have its
side-effects deduced automatically from the implementation, and provide its
own serialization routine.
Both CORBA and COM use Do-it-yourself serialization and Describe-it
for generating proxies and adaptors. As in RPC, the special language is
called the Interface Description Language (IDL). Since objects in
every language must be mapped onto it, IDL tends to have lots of modern
features like class types, triggers, and multiple interfaces. IDLs may
become the new playing field for type research, especially since you don't
have the overhead of an entire language.
Thomas Minka
Last modified: Fri Sep 02 17:23:40 GMT 2005