Awareness in Everyday Scenes: Perception and Representation

[An unfinished essay]

Nitin Sawhney
June 2nd, 2000

Early Representational Approaches:

Vision - Gibson, Marr

New approaches for Representation

Edelman, Poggio & Vapnik

Audition - Bregman

Attention Psychology - Pashler

Recent Computational Approaches for Audio/Visual Scene Analysis:

What's missing in current approaches? E.g. Robust understanding of everyday scenes!

Salient activity in Visual Scenes - Nuria, Schiele, Clarkson, Grimson, Hogg

Segregation in Auditory Scenes - Dan Ellis (CASA), Slaney [correlograms, affect], Clarkson,

 

Early Theories of Representation (50's - 70's)

Early Vision (& Audition) Approaches: Comparing ad-hoc techniques like trying out different operators on images or simply constraining experiments to toy problems. These approaches (thresholding, filtering, edge-detection) worked reasonably well on toy problems, but did not generalize well on real problems.

Early attempts at object recognition and scene understanding (1970's) all relied on the extraction of line-like primitives ("edges") from intensity images. These were subsequently combined into more complex constructs using explicit rules. Turned out to be unreliable. Studies in biological vision in the 70's by Hubel and Weisel even tried to characterize mammalian vision in terms of orientation selective cells responding preferentially to short line segments. They compared it to the current computer vision approaches; this was unfortunate because alternative explanations existed.

Current successful approaches are data-driven and statistical in nature. However they are usually highly specialized and also don't generalize well in natural everyday scenes. We will consider two examples here:

Data-driven scene-understanding in vision and audition - brief case-studies:

Bobick and Davis '97 - MEH for behavior in visual scenes

- data-driven, not biologically or psychologically motivated, constrained for recognition in only a specialized class of emphatic actions in visual scenes.

Nicolas Arnaud '95 & Sawhney '97 - audio textures & environmental audio categorization

- data-driven clustering, perceptual features (Mel-cepstral coeffs), but unsatisfactory performance

Both used simple distance metrics like Mahalanobis distance or Nearest Neighbor techniques. More sophisticated machine learning (like SVM) may improve performance, but I believe there is a fundamental problem with the representations used, that inhibits better understanding of the scenes.

Tony mentioned that today we have standard vision/audio techniques (hacks) to build perceptual learning machines that can get 70% of the job done, but the last 30% is exponentially complex and often intractable.

Goal: Are there better representations to guide "high-level" analysis of audio/visual scenes?

Why should we focus on representation?

Let me try to motivate this by considering questions posed by Gibson and Marr about our everyday scenes.

J.J. Gibson (1966) - How does one obtain constant perceptions in everyday life on the basis of continually changing perceptions?

Gibson used an "oversimplified" view. He considered higher-order variables - stimulus energy, ratios etc. as "invariants" (of movement of observer or changes in stimulation intensity) corresponding to permanent properties of environment. Led him to believe function of brain was to "detect invariants" despite changes in the environment (light, sound, pressure), rather than interpret, organize or process sensory data.

A methodological framework emerged from joint work of Marr and Poggio in the mid-70s. They insisted on the understanding the goal of vision before trying to characterize its details.

David Marr thinks of vision (or perception in general) as an information-processing task, but stresses the role of representation [Marr -Vision '82]. If one is to make any sense of our perceptual world, we must have some way of representing it internally as a basis for decisions about our thoughts and actions.

What is the nature of our internal representation?

This is not so straightforward, lets consider an example from evidence in neurophysiology:

"Reductionist" Approach (Barlow 1953-72)

Evidence on level of neural processing from neurophysiological expts:

Barlow's (1953) study of Ganglion cells in the frog retina suggests that the retinal neurons are selective and act as "bug detectors" - a primitive but vital form of recognition. i.e. when shown a key movement pattern - "trigger features", cells exhibit a vigorous discharge, regardless of level of illumination. Hence a large part of the sensory machinery involved in frog's feeding responses lies in retina rather than in some "mysterious centers". Barlow (1972) then summarizes that "each single neuron can perform a much more complex and subtle task than previously thought … Activities of neurons are quite simply thought processes."

Key problem: Despite the excitement of these discoveries, Marr and others found that this Reductionist approach could not be taken all the way. Neurophysiology and psychophysics described the behavior of cells, but did not explain it. We need evidence from both within the framework of computational theories.

Marr states that we need different types of explanations at different levels (neural & info-processing). Need for clear statements of what is to be computed and how, physical assumptions, and analysis of algorithms that are capable of carrying it out. Is it optimal?

Representation - a definition:

'Formal scheme' for making explicit certain entities or types of information, along with a specification of how the system does it. E.g. a musical score provides a way of representing a symphony (symbols with rules for putting them together). Each representation has different affordances (binary vs. decimal) i.e. makes certain information explicit or easier to recover at the expense of hiding other less relevant aspects.

Consider 3 levels of information processing:

Computational theory: 1. Separate arguments for what is computed and why? 2. Constraints should uniquely define the operation of processes.

Representation & Algorithm: 1. Need to choose a representation for process I/O and 2. an algorithm for doing the transformation. e.g. for addition input & output representations (numbers) can be the same. But in Fourier transform input is in the time domain whereas the output is in the frequency domain.

Physical Implementation: There is a wide choice of representations and several possible algorithms for the same process. The choice of algorithm depends critically on the representation employed and the machinery within which it is physically embodied! Animals use vision for different purposes, hence it is inconceivable that all use the same representations.

Importance of Computational theory - algorithm better understood by understanding nature of problem being solved rather than physical embodiment. E.g. studying bird flight by examining feathers only.

Neuroanatomy and Neurophysiology most closely related to physical realization of computing, but one has to be careful to make inferences about representation and algorithms used. Whereas Psychophysics is most directly related to both, hence can help determine the nature of representation. Hence we can look at evidence from psychoacoustics [Bregman90] and attention perception that inform computational approaches like CASA [Ellis96] and techniques for visual scene understanding [Clarkson99, Schiele96, Oliver99].

New Theories of Representation (90's):

An Alternative Approach - Representation without Reconstruction:

Lets consider the "Reconstructionist" Approach (Marr 82)

Representation considered internal library of geometric models (primal and 2.5-D sketch). Here representation must be able to reconstruct the scene in its fullest possible geometric detail.

Representation by Feature Hierarchy [first proposed by Lettvin'59]

Based on "bug detectors" in frog retina. Representational power of an ensemble of feature detectors may far exceed that of its constituents alone. Mar had considered the world too complex to yield to analysis suggested by feature detectors. Poggio (92) found evidence regarding hyperacuity - activity pattern of a set of overlapping receptive fields represents all information needed to determine direction of offset, without allowing its reconstruction. Similarly, Perception of Coherent Motion - MT cells in monkeys are tuned to coherent motion in particular directions. Ensemble of activities of these cells can represent motion of the visual field seen by the animal. This visual motion can not be considered reconstructed in the activity of an MT cell. Even in stereopsis, 2 disparate images needed, representation encodes depth information, but does not have to be reconstructable. This suggests a re-evaluation of the assumptions behind the abstract (Marr) nature of tasks that the visual system confronts.

Edelman cites Shepard (68) to propose 2 kinds of representations:

1st order Isomorphism - structural or metric info. stored in brain reflects corresponding properties of shapes in real world.

2nd order Isomorphism - need not resemble object but system represents relations among distal objects.

Holland (86) proposes representing state transitions instead of similarity relations for prediction & learning.

Current state of art computational theories of recognition:

Structural Decomposition - Shapes of objects described in terms of few generic parts e.g. Recognition by Components (RBC) - 30 or so primitive shapes (gaeons). Unreliable detection and instability of description.

Geometric Constraints - posits lists of coordinates of prominent features associated with objects as their representations. Storing a few views obviates need for maintaining full 3D models. Need for feature correspondence, cannot abstract category, comparing 2 objects at a time vs. statistical variation in all.

Multidimensional Feature Spaces - hybrid representation: some dimensions can be discrete (structural) while others continuous. Comparison by simple distance metrics - nearest neighbor or clustering in feature space. E.g. Multidimensional histograms - color [Mel97] and intensity [Schiele96] work well for object recognition. It tolerates variation in object details - groups similar objects together.

Computational Problems with Feature Spaces:

Support Vector Machines [Vapnik] uses discrimination in high-dimensional space of transformed features.

Edelman proposes - low-dimensional proximal representation of shape space.

Edelman's Isomorphic Representation for Shape Recognition

Shimon Edelman (1998) claims "computational theories of representation which forgo reconstruction lead to simpler and effective recognition and more credible working models of human recognition performance."

Edelman cites Lockes' Essay Concerning Human Understanding - "an idea represents a thing in the world if it is naturally or predictably evoked by that thing, and not necessarily … resembles the thing in any sense."

Edelman uses this approach towards visual recognition of shapes. Objects with comparable shapes are mapped into the same neighborhood of an internal shape space. [Figure 1.1] Develop computational mechanisms to support recognition-related tasks based on a navigation metaphor. Objects treated as points in a shape-space "landscape". Allows categorization and identification (location of stimulus), approached as navigation in real terrain wrt landmarks. Suggests a mechanism (for implementing shape space localization) to be a tuned unit that responds optimally to some shape and progressively less to dissimilar shapes. First implemented as a Radial Basis function approximation network - given several views of a shape the network was trained to produce roughly constant response to other views of the same object. It is capable of categorizing novel objects and estimating their orientation from only a single "training" view of the object. Edelman cites recent neurobiological findings that suggest cells in infrotemporal (IT) cortex that do indeed respond to a wide variety of shapes with varying efficacies rather than being highly selective (narrowly specialized to only specific objects). In Edelman's Chorus model, the shape space consists of outputs of functional modules tuned to a range of views of an object.

What makes a good Representation?

Representations that allow the resolution of matching process to be tuned are satisfying.

E.g. hierarchical features and wavelets?

Computational Approaches for Audio/Visual Scene Analysis

Audio - Ellis96+98, Slaney [Correllograms]

Vision - Schile98 + Oliver99 + Grimson & Stauffer'98

Audio+Vision - Clarkson99, Roy99