MAS 964 (H): Special Topics in Media Technology:
Camera Culture
Spring 2008

Class 2: How can we augment the camera to support best 'image search'?

Or more precisely, object matching across images.

(For example, if we find to find a specific face image, we need a procedure to segment and identify (detect) the pixels likely to belong to a face, then recognize the candidate face by transforming into a representation where we can match with that specific face image. Currently, this is all performed in software using traditional cameras. Typically, the algorithms try to reduce the image to lower-dimensional 'features' and do the matching in this feature-space. Unlike text search, where the search pipeline is simple thanks to easy matching process, object-matching-in-images is quite difficult. What can we additional data can we capture while recording pixels and what new algorithms can exploit this augmented photo?)

How can we make the scene ingredients machine readable so that we can easily perform the 'search'? Is this the key problem? 3D
reconstruction (so that it is view independent, )? Hardware and software solutions? Crowdsourcing (let people do
marking/sorting/indexing for others)? Metadata tagging (tag highlevel text labels rather
than pixel-level tagging)?

Do we need to capture Material index (where is all the wood in this image)? Segmentation
boundaries (shape versus reflectance edges)? Repeatable view and illumination invariance (be able to recreate image from
a given view so it can be compared with another image, or create images that look same
independent of time-of-day)?

Some ideas: (i) to locate all 'images' with faces, record the iris biometric which validates
if a photo includes a human eye, and then we can search all images across an album with
that face/eye/iris, (ii) embedd RFID tag (electronic bar-code) in every object and record the binary index with an RFID reader.

Contributors so far:
Tom Yeh

(scroll to individual responses)

Tom Yeh

article proposes five ways to augment the camera to support **best**
image search:

1.  Interact with the People and the World

   Computer vision systems used to drive image search are often
handicapped because they are forced to work with whatever images that
are handed to them. They may get confused by ambiguities in an image.
They may wish this confusing image could be taken again by the person
who took it, hoping that the second attempt can clear things up.
However, this is only wishful thinking, because there is no way the
photographer would revisit the place and take the image again.

   One way to make the dream of every computer vision system come
true is to put it right on the camera and let it work its magic
whenever the camera is capturing an image. This setup gives the vision
system an opportunity to directly interact with the photographer or
with the world. In low-level vision tasks, for example, the vision
system may think that the  image being captured is too dark to provide
enough details for searching. To alleviate this problem, the system
can *kindly* ask the photographer to take an image again in a more
favorable lighting condition. The photographer should oblige unless he
does not wish to make the image searchable. In high-level vision
tasks, for example, the vision system may be unsure whether the object
in the image is a dog or a cat. It can simply ask the photographer to
help clarify the matter when the photographer is still pointing the
camera at the animal.

   Sometimes the photographer may not be kind enough to help. Unable
to depend on humans, the vision system needs to interact with the
environment by itself. Interaction with the environment can be carried
out in the following ways:

   *  The vision system can interact with the environment with a
flash, for instance, forcing the flash to fire if it feels the image
will be too dark for image search.

   *  The vision system can interact with the environment with a
sonar, for instance, to measure the distance of each point in the

   *  The vision system can interact with the environment with a
projector. For example, the vision system may be unsure whether there
are two balls of the same size at the same distance or there are a
large ball further away and a small ball closer to the camera. To
resolve this puzzle, the vision system can simply make a hypothesis
that the two balls are of the same size, command  the projector to
project an image of two circles as large as the size of the ball back
the scene, and check whether the projected circles appear equally
large. If so, we have reasons to believe these balls are indeed
equally large and distant.

2.  Go Beyond the Visible Spectrum

   The original purpose of photography is to obtain images that can
recreate the visual experiences of seeing something when we look at
these images. Since seeing is our only concern, there is no need for
images to store anything other than what can be seen by us. Even
though a camera can see more than what we can see, everything outside
of our visible spectrum is not needed for seeing and is thrown away by
the camera.

   However, as the digitization of photography renders films
virtually free, the number of images an average person takes has
exploded. Search becomes important. The invisible spectrum contains
useful patterns about the visible world we observe, even though we can
not see them. Our need to search images gives cameras a reason to pay
attention to the whole spectrum instead of only the visible spectrum.
For example, an infrared sensor can capture the temperature patterns
in a scene, allowing us to search for images containing warm objects
or cold objects. An X-ray sensor can capture patterns underneath
certain surfaces, allowing an image search engine to distinguish
between a real human being and a fake mannequin.

    Moreover, it is possible to physically augment objects in the
real world with ink patterns invisible to human eyes but visible to
cameras. These patterns can provide non-visual information about the
object, which can be captured by the camera when the user takes a
picture of the object. Because these patterns are invisible, they do
not affect the aesthetic of the real object nor the content of the
image of the object. Yet, information encoded in the patterns can be
used to  improve image search. For example, a museum curator can write
messages using invisible ink on each painting in a gallery. Each time
a vistor uses a camear to take a picture of a painting, the camera
does its usual job of capturing the visual appearance of the painting.
Meanwhile, the camera also does a special job of deciphering the
message embedded in the invisible ink pattern. The message may contain
useful information about the painting, such as the title and the name
of the artist, which can be saved in the meta-data of the image and be
used to improve image search.

3.  Remember the Context

   Although a picture is worth a thousand words, these words only
describe what can be seen. However, the context in which a picture is
being taken often contains a lot of important information that can not
be seen by the camera and thus can not be captured in the picture.
Adding such contextual information to a picture as its meta-data can
potentially improve the capability of image search.

    * Location - A location sensor such as GPS can be used to detect
the location of a camera whenever it takes a picture. If the format of
the location tag is consistent for all the cameras  in the world,
combined with the time-stamp, it is possible to automatically group
together all the images taken at the same location around the same
time. If one of the image is being tagged with a keyword, this keyword
is very likely to be useful for other images in the same group, saving
us the efforts of tagging them.

    * Attention - The location of the camera only tells us where the
camera is, but it does not tell us what the camera is paying attention
to. If we assume every location has only one object worth taking
pictures of, this is not a problem. However, when a location contains
multiple point of interests, as it is often the case, we need to know
where the camera is actually facing at so that we can infer which
point of interest the image to be captured is expected to contain.
Then, an image search engine can compare two images by checking if the
cameras used to took them were located in a similar location and were
facing in a similar way. If so, it is very likely these two images may
share similar visual content.

    * Sound - Sounds that occur in the surrounding when an image was
taken may provide valuable cues to allow us to infer the content of
the image. For example, if a camera heard a **cheese** when its
shuttered button was pressed, it is quite likely the resulting picture
might contain human faces showing white teeth. The degree of the
background noise often provides hints as to whether an image was taken
indoor or outdoor. Bark, Meow, Oink, and Moo are all strong indicators
of the presence of animals as well as their identities.

4.  Suggest Tags

   If we could take our time adding informative tags every time we
take a picture, image search would be trivial. The camera interface
could be built to force people to add meanigful tags when they take
pictures. But such interface will annoy people to death. It is time
consuming and cognitively burdensome to come up with just the right
tags for every image.

   Instead of posing a difficult open-ended question of "what tags do
you want to add to this image?", we can use an AI system to suggest a
short list of tags based on the content of the image. The goal of list
is to convert the difficult question of *what* into a much simpler
form---a yes/no question such as "is the tag *dog* appropriate for
this image?" To suggest relevant tags, the AI system will need to use
computer vision to analyze and understand the content of the images.
Another way the AI system can come up with tags is through
collaborative tagging. It can check the history and look up tags other
users have previously given for images with similar visual and
contextual cues. Then, it can suggest these tags to users and see if
they will accept or reject these tags.

5.  Guess Who?

   One of the primary uses of image search is to find images
containing a specific person from a large collection of images. This
kind of search is very difficult unless someone has painstakingly
tagged the identity of every person in every image in the collection.
However, manual tagging is undesirable because it is time consuming
and tedious. Here we describe two approaches to automatic tagging of
the identifies of the participants in group photo taking scenarios.

   * Photographer identification. We can install a biometric sensor
on the camera to detect the identify of the person who is snapping the
photo. This can be done by putting a fingerprint sensor on the shutter
button, or an iris scanner at the viewfinder. Although these sensros
can not tell us who are in the pictures, they can tell us who are
*not* in the picture. In situations when two people are traveling
together and taking turns snapping pictures of one another, knowing
the absence of one person automatically tells us the presence of the
other person.

   * Photographee identification. We can install a microphone array
to detect multiple speech inputs. Then, instead of asking everyone to
say cheese**, we can encourage the new practice of shouting each one's
name whenever a group picture is taken. The camera can use a speech
recognizer to recognize the names being uttered and tag the picture
with these names. If the social norm of saying **cheese** is difficult
to change, the camera can try to recognize each person's unique voice
pattern of the word **cheese** and tag the picture with that person's