Modelling foveated vision in Matlab

This talk is about some functions and scripts in the programming
language Matlab (or GNU Octave) which I prepared in the last years to
study algorithms for an active vision system.

At any time, our eyes can see only a small part of the viewing field
in high detail, namely the 5 degrees mapped into the fovea. We change
our fixation target about three times per second without noticing this
most of the time.  The eye movements are called saccades.  Subjectively,
the part of the world, in which we get a detailed and veridical view of
the world, seems to be a lot larger.  When we try to imagine a concrete
algorithm for the task of accumulation of details fxorom foveate fixations,
called saccadic integration, things become really hard.

Here we have a simple example of a fovea on a hexagonal grid with an
inner, almost uniformly sampled, hexagon of side length 6, and an outer
grid in log-polar form with 6 circles of 36 positions. This is the grid
with the least number of disruptions, namely the 6 corners of the inner
hexagon, for the given numbers. All other inner points have exactly 6
neighbours, and the inner position is the mean value of the adjacent
positions.  The examples which I present later are taken with a side
length of 20 and 64 circles outside intended to map a viewing field of
about 10 degrees and for Cartesian images of size 512x512.

Some tasks in processing foveate images are hard even before we try
to put fixations together. Finding contours and lines in an image, is
solved by an approximation of the Laplace operator, which gives us,
in David Marr's terms, a primal sketch of the content of the image.
The detection of common geometric primitives such as straight lines,
circles and so on, is not possible without knowing how the image content
transforms under the movements defining these primitives in the Euclidean
geometry of the physical world.  If we had a Fourier transform available,
these features would be detectible by patterns in the Fourier coefficents
at fixed places. Similarly, the length comparison comes by the wave length
in frequency domain.  And lastly, though this is not implemented yet,
the algorithms should be extensible to images containing motion.

The subjective perception space should contain much more positions than
the eye space, in order that there is room where the details from several
fixations can be put together. This perception space is probably sited
in a region of the brain cortex called V1, and it is almost log-polar
organised like the eye space. The fact that it is split between the brain
halves, helps to map the peception space halves to the V1 surface in an
area preserving manner. On the perception space, additionally to luminance
and colour, the distance variable is maintained also by the brain making
the image surface a relief, which was termed 2-and-half-dimensional by
David Marr.

I use complex numbers for the positions, because the additional operations
available in the complex number field have useful interpretation in
image processing. If we consider a complex number as a vector in the plane
having a magnitude and an angle to the real axis, the multiplication of
positions becomes scaling with the magnitude and rotation by that angle.

In the examples, for simplicity we do not distinguish between the
perception space and the original Cartesian image.

The foveate grid is constructed by a custom function, which I named
"hexnet".  It returns a vector of complex numbers for the positions,
and a sparse matrix denoting the adjacency relations of the foveate
graph. Using sparse matrices, wherever possible, is essential for the
implementation of these algorithms on the computer. But sparseness
is also present in physiology, because any cell can interact directly
only with a few other cells.

At first, operators are needed to interpolate and downsample images
between the 3 spaces. There are intrinsics for this purpose available
in Matlab, but they have a deficiency, which is seen when we look at a
simple interpolation situation: 1-dimensional uniform samples.

For this situation a frequency-bandlimited interpolation for the whole
real line is known explicitly by the Nyquist-Shannon sampling theorem.
The interpolation is a sum of weighted translated copies of the cardinal
sinus function, here plotted in red for positive arguments. This
solution has properties not present in simple interpolation strategies:
Interpolated values may fall outside of the input range, either by
the negative values, or by the addition of distant values. In case of
high frequency input, when the input samples change their sign often,
we get effects reaching over a long distance, like that which we find
in the Hermann grid, when we consider straightness of the lanes as a
sort of high frequency.  For 2-dimensional images, a circular symmetric
interpolation function is wanted.  The radial part is given by the FT of
the unit disk as the quotient of Bessel function J1 and the argument. I
named this function jinc, and it is plotted in blue.

My interpolation operators are based on this function.  Next I show an
example of the foveation of the Lena test image. The effect of foveation
is better seen in the right image, which shows its Laplacian.  We see
the enhancement of contours by this operator.

The Laplace operator itself is an infinitesimal operator, so we have
to use a discrete approximation on our spaces, which is mostly a low 
frequency approximation of the real Laplacian. I use the difference
between a suitable multiple of the adjacency matrix and the unity diagonal
matrix in each space, scaled such that the value on quadratic forms is
the same in each space. Remember that on constant and linear functions
the Laplacian is zero.

How does the foveation act in the Fourier transform? We can compare the
transforms for the original image and its foveation, and we see that the
FT is foveate itself. Only in the center, where low frequencies reside,
rapid changes are present. In the computer, we can do the FT in a foveate
space by interpolation to Cartesian, using there the 2-dimensional
FFT, cary out any operation like the multiplication with phase factors for
a translation, and transform back again. In a physiological system,
this is not a feasible operation, just as the direct application of a
Fourier matrix is not, which is possible only for the low dimension of
the foveate space. I come to this point later. We are now ready to give
a concrete algorithm for saccadic integration.

In these pair of Laplacian images, the fixation point is moved from
one eye of Lena to a point near the other.  First, the new state is predicted
by anticipation of the translation caused by the saccade in perception space. In case
of motion in the images, the changes caused by the motion have to be
predicted also. Then the prediction is foveated down to eye space, where
it is subtracted from the new fixation input. When the difference is
interpolated back to perception space, it can be added there, and the
details from the previous fixations will be preserved as long as the
perception space can store them.

Now I come to the reason why complex numbers are a perfect tool for
dealing with the eye data. The central perspective map from the world to
a plane resembles the construction of the Riemann sphere for the complex
projective line (which is 2-dimensional when viewed as a real manifold).
Any transformation caused by an Euclidean transformation of the plane
against the Riemann sphere, can be described by a simple formula in
complex numbers, called Möbius transformation. Jacek Turski has shown
in a series about the topic of "conformal camera" that Fourier transforms
are possible, and he published meaningful results for several situations
in active vision.

Now for my last topic: the implementation of a FT. Ẃe take a look into
quantum mechanics where the Schrödinger equation for a simple system -
the harmonic oscillator - has the property, that the initial wave Psi,
after a time of pi/2 is transformed into its FT, and after pi/2 again,
it is the original, but reversed in space.  We see, on the right
side of the equation, only local operators representable by sparse
matrices. Therefore the equation can be solved for small time steps -
about 5000 for 128x128 images - approximately. The fact that after
the same time the original image reappears can be used for iterative
refinement of the first solution.

Now my conclusions:

1. FT is needed for active vision.  2. FT is feasible under the locality
restriction.  3. There are indeed signs, that nonlocal (FT-like) effects
are present in low order vision. (scintillating version of the Hermann
grid) 4. Movements are the link between the groups of the physical world
and the transformations in the image plane, especially the small ones:
microsaccades.  5. The functions and scripts are available on my Dokuwiki
web site.