Goals for transforming and compressing the results of visual perception

The primary goal is success in the framework of evolution. More specific goals should help survival (fleeing and feeding) and propagation.

The visual output should help to select and adapt action to present circumstances.

The action must have options with conditions, so that a given action-variant or option is chosen if a given situation is perceived. In other words, we need rules of the form "if you see this then do that".

An action-variant may have multiple conditions such as "if you see this and that then do this followed by that".

The condition must store enough information about the visual perception so that it can be compared to the present visual perception.

The comparison might result in a likelihood estimate that the current perception matches the stored pattern.

Actual images depend on distance, incident light, tilt of the head, and many other factors

The output from the current perception, and the stored image-pattern must have stripped away non-essential image components, and somehow 'standardized' the rest with suitable transformations, so that a meaningful comparison is possible.

The representation of the visual information should be relatively invariant over time and over variations in the environment

We might make several transformations in sequence

The first might extract features such as edges. An analogy from computing might see this as a change from a pixellated representation to a geometric representation with scalar vectors, motion, etc.

The final transformation might identify persistent and known objects

There might be further comparison to visual templates for known objects

We might also detect the size, relative motion, and other attributes of these known objects

This characterization should help with action selection: predator or prey, close or far, etc.

We need help with specific action, such as where to turn to flee or to give chase

Next, let us assume that we are watching a parent or other mentor, so that we can copy their behaviour

We need to characterize the perceived action sequence

We need to translate this characterization of visual information into information about action

The perceived action sequence has to be 'translated'

One step by the parent may require several steps by the child.
If the parent looks in a certain direction, the child has to extrapolate and correct for parallax to have approximately the same perception.

For instance, if the parent looks in the direction of, but past or to the side of the child, then the child has to turn and figure out where the parent might be looking.