Vision Module Guide

A guide to understanding and using ACT-R/PM's Vision Module.

How It Works

The Vision Module is the most complicated of the RPM Modules. As one might expect, the Vision Module is used to determine what ACT-R "sees." How this is managed by the Vision Module is depicted in Figure 1.

Figure 1. Vision Module configuration.

The Vision Module takes a window and parses the objects in that window. Each object will be represented by one or more features in the Vision Module's icon. These features are the basic units with which ACT-R interacts. Each feature contains information about what kind of screen object it represents, where it is, and so on.

The mapping from screen objects to icon features is not necessarily one-to-one. A screen object can create multiple features, depending on what that screen object is and how the Vision Module is configured. Most screen objects will generate one feature, the most common exception being anything containing text. Text will get parsed into multiple features. If, :OPTIMIZE-VISUAL is true (the default), then each word in the text will be parsed into one feature. Thus, long sentences will be parsed into many features. If :OPTIMIZE-VISUAL is false, then each letter is parsed into multiple features. The features consist of a LED-type representation of the characters of the text. The following shows a graphical representation of the features, the number values associated with each feature, the letter E made from the features, and an abstract-letter chunk representing the letter E.

  - -     1 2          - -    
 |\|/|   34567        |        (letter-e
  - -     8 9          - -       isa abstract-letter 
 |/|\|   01234        |          value "E"  
  - -     5 6          - -       line-pos (1 2 3 8 9 10 15 16))

The Vision Module creates chunks from these features which provide declarative memory representations of the visual scene, which can then be matched by productions. Generally, each icon feature maps to one chunk in declarative memory. The notable exception is again text. If optimizing is on, then the basic unit is words, though the Vision Module can be told to look for phrases. If optimizing is off, then the Vision Module can be told to look for letters, words, or phrases. Letters are synthesized from clusters of the LED-style features according to a Bayesian categorization algorithm to determine the best letter given a set of LED features.

For the Vision Module to create a chunk for an object, visual attention must first be directed to the location of that object. In order to do that, the Vision Module provides a command to move attention. The attention movement command requires a specification of a location to which to move. These locations exist as "virtual chunks" of type VISUAL-LOCATION. Whenever ACT-R requests a match of a chunk of type VISUAL-LOCATION and the "time now" flag is included in that test, ACT-R does not actually attempt to retrieve a chunk from declarative memory but instead calls the Vision Module. The Vision Module then checks the icon to see if there is a feature in the icon that matches the chunk requested by ACT-R. If one exists, a chunk representing that visual location is created and returned to ACT-R in zero time. If none exists, then the left-hand side of the production requesting the match will fail.

The chunk representing the visual location can then be passed back to the Vision Module along with a move attention command, which causes the Vision Module to shift attention to that location. If there is anything in the icon still at that location, then a chunk will be created which represents that object, and that chunk is considered the focus of attention (making it an activation source). If there is more than one object at the location moved to, then one of them will be picked randomly to be the focus of attention. These chunks are VISUAL-OBJECT chunks (or a subtype of VISUAL-OBJECT). The basic assumption behind the Vision Module is that the visual-object chunks are episodic representations of the objects in the visual scene. Thus, a visual-object chunk with the value "3" represents a memory of the character "3" available via the eyes, not the semantic THREE used in arithmetic--some retrieval would be necessary to make that mapping. Same thing with words and such. Note also that there is no "top-down" influence on the creation of these chunks; top-down effects are assumed to be a result of ACT's processing of these basic visual chunks, not anything that ACT does to the Vision Module.

Because the currently attended object chunk is an activation source, it should always be retrievable. However, there is no guarantee that when trying to retrieve a visual-object chunk that the current focus will be retrieved--there are probably lots of other visual-object chunks floating around in declarative memory. So a mechanism that allows ACT to discriminate between what it currently sees and the memory of things it has seen is again through the use of the "time now" flag. Including "time now" in a LHS clause matching a visual-object (or a subtype) will actually be translated into a call to the Vision Module rather than a true test of declarative memory. The Vision Module returns the chunk representing the current visual object, and the ACT-R pattern matcher decides if that chunk matches the specified pattern.

There is one more issue to consider, which is that of change. There are a couple issues with respect to change. First, what happens when the screen changes at the currently attended location? If the currently attended object changes, the system will mark the Vision Module as "busy" and will deliver a new chunk for what's there after the standard attention movement delay. Second, what happens when objects move? In general, nothing. However, the Vision Module can be told to track an object. When this happens, the currently attended location moves with the object, and the object's SCREEN-POS slot will be updated as the object moves.

Using the Vision Module

Syntactic stuff

There are two primary chunk types used by the Vision Module: VISUAL-LOCATION and VISUAL-OBJECT. VISUAL-LOCATION chunks have the following slots:

SCREEN-X
SCREEN-Y
ATTENDED
OBJECTS
KIND
COLOR
SIZE

SCREEN-X and SCREEN-Y specify where the location is (integer values). ATTENDED notes whether or not that location has been attended to (T or NIL). OBJECTS will contain all the VISUAL-OBJECTs at that location. KIND specifies what kind of feature was found there, and COLOR specifies what color that feature is. Finally, SIZE is the area of the feature in square degrees of visual angle.

When using the "time now" test for a VISUAL-LOCATION chunk, there are some special specifiers for the SCREEN-X and SCREEN-Y slots which the Vision Module understands:

(GREATER-THAN value) will match only if the value of the slot is greater than the supplied value.
(LESS-THAN value) will match only if the value of the slot is greater than the supplied value.
(WITHIN min max) will match only if the value of the slot is in the range min to max, inclusive.
LOWEST will find the feature that matches all the other criteria, but with the lowest value for this slot.
HIGHEST will find the feature that matches all the other criteria, but with the highest value for this slot.

The SIZE attribute will also work with GREATER-THAN and LESS-THAN and WITHIN.

VISUAL-OBJECT chunks have these slots:

SCREEN-POS
VALUE
COLOR
STATUS

SCREEN-POS is a pointer to the VISUAL-LOCATION chunk specifying where on the screen the object was seen. VALUE is some kind of content slot that will be filled in by the Vision Module. COLOR represents the color of the object. STATUS is a slot that initially contains NIL and can be used by ACT-R productions to annotate the chunk.

Tasks

The most common task with the Vision Module is probably locating objects on the screen and attending to them. This simple production will cause the Vision Module to attend to something on the screen that has not been previously attended with the lowest x-coordinate:

(p attend-lowest-x
   =goal>
     isa some-goal
   =loc>
     isa visual-location
     time now
     screen-x LOWEST
     attended nil
==>
   !send-command! :VISION move-attention :location =loc
)

The move-attention command tells the Vision Module to move attention to the specified location, which is =loc. The test for a VISUAL-LOCATION with time now means that the current icon will be checked for a feature (rather than a chunk being retrieved). If such a feature exists, a chunk of type VISUAL-LOCATION will be created and will be bound to =loc. This chunk is then passed back to the Vision Module and attention is moved to that location.

After the Vision Module has shifted attention, if there is anything in that location, a chunk will be created that represents it, and the current visual focus will be on that chunk. If you wanted to know if the current thing being looked at is the word "foo" the test would look like:

   =obj>
     isa VISUAL-OBJECT
     time now
     value "foo"

For text, the string the text represents is passed back in the VALUE slot. The time now is necessary to discriminate this chunk--the one currently being looked at, from some memory of another "foo".

How to tell if two things are co-located

Sometimes, it's important to know if something is in the same location as something else. For instance, one wants to know if the cursor is over a button. This can be accomplished by taking advantage of the fact that the focus won't change if a new object appears in the location of the object in the focus. Thus, if you're already focussed on the button (which is represented as an oval), this should work:

(p button-and-cursor-co-located
   =goal>
     isa some-goal
   =btn>
      isa VISUAL-OBJECT
      time now
      value OVAL
      screen-pos =loc
   =loc>
      isa VISUAL-LOCATION
      time now
      kind CURSOR
==>
   ...
)

This production should only fire when there is an object in the current focus of attention, and that object is an oval (button), and when there is an cursor feature on the screen such that the location of that feature matches the location of the button.

How to tell if there's nothing at a location

There are several ways to tell if there is nothing at the currently-attended location. First, make sure that location is attended (check the modality state of the Vision Module, it should be FREE). If you know in advance that you'll need to know if there is nothing in a location, this is easy. When you move attention, store the location in the goal:

(p attend-lowest-x
   =goal>
     isa some-goal
   =loc>
     isa visual-location
     time now
     screen-x LOWEST
     attended nil
==>
   !send-command! :VISION move-attention :location =loc
   =goal>
     lastloc =loc
)


(p nothing-there
   =goal>
     isa some-goal
     lastloc =loc
   =loc>
     isa visual-location
     objects nil                        ; this is the key
   =state>
     isa module-state
     module :VISION
     modality free
     last-command move-attention
==>
   ...
)

Notice that this does not have a time now in the VISUAL-LOCATION test--if there's nothing at that location, then a test using time now will fail.

Alternately, you can do this without using anything in the goal, and set up two productions, one of which does something when an object is present and one of which is a defualt. If you're concerned about the currently-attended location, whatever that may be, try this one:

(p something-present-at-current
   =goal>
     isa some-goal
   =obj>
     isa visual-object
     time now
   =state>
     isa module-state
     module :VISION
     modality free
     last-command move-attention
==>
   ...
)

which will fire if there's anything at the currently attended location. However, if you want a specific location, whether it's currently attended or not, then if this production:

(p something-present-at-40-40
   =goal>
     isa some-goal
   =loc>
     isa visual-location
     time now
     screen-x 40
     screen-y 40
   =state>
     isa module-state
     module :VISION
     modality free
     last-command move-attention
==>
   ...
)

will fire if there is anything at (40, 40) and fail if there isn't. Whichever method you use, then have a second production act as a default:

(p nothing-at-location
   =goal>
     isa some-goal
   =state>
     isa module-state
     module :VISION
     modality free
     last-command move-attention
==>
   ...
)

This production will always fire, of course, but it is possible to get the desired behavior by making the relevant something-present- production preferred in conflict resolution.

Creating icon features

Creating icon features is handled by the device, so please see the section on the Device Interface.

Creating memories representing objects

If you define your own class for visual features with your device, you may also want some way of translating those features into chunks. There are at least two approaches to handling this:

[1] Create your subclass with visual-object as the default for the kind slot and whatever you want for the value slot. RPM will use the default feat-to-dmo method on your features to translate them into chunks of type visual-object. This approach is fairly limited but is simple.

[2] Along with your class of features, define a chunk type to represent objects of this feature class and :include the visual object chunk in that definition. Be sure the kind slot in your feature objects matches the name of the chunk type you defined. You will then need to write a feat-to-dmo method for your feature class which translates feature objects into declarative memory objects. (Declarative memory objects--or DMO's--are how RPM understands chunks. Creating DMO's also creates ACT-R chunks.) Probably the best way to do this is use call-next-method to get the default slots (e.g. location, kind) set and then use set-attribute on the result to set the remaining slots. To go back to the arrow example, you'll need a chunk type which encodes an arrow:

(chunk-type (arrow (:include visual-object)) direction)

There needs to be a method for going from a feature to a chunk (this happens when move attention is called on the location at which the feature resides), which is a feat-to-dmo method:

(defmethod feat-to-dmo :around ((self arrow-feature))
  (let ((the-chunk (call-next-method)))
    (set-attribute the-chunk `(direction ,(direction self)))
    the-chunk))

This is a little tricky, since it's an :around method that knows what the default feat-to-dmo method does, which is note that this feature has been attended, generate a DMO representing the object, and return that DMO. The :around method just takes the returned DMO and modifies it. The base method is this:

(defmethod feat-to-dmo ((feat icon-feature))
  "Build a DMO for an icon feature"
  (setf (attended-p feat) t)
  (make-dme (dmo-id feat) (isa feat)
            `(screen-pos ,(id (xy-to-dmo (xy-loc feat) t))
                         value ,(value feat)
                         color ,(color feat))
            :obj (screen-obj feat)
            :where :external))

If you don't want to mess with :around methods, you could just write it like this:

(defmethod feat-to-dmo ((feat arrow-feature))
  "Build a DMO for an arrow icon feature"
  (setf (attended-p feat) t)
  (make-dme (dmo-id feat) (kind feat)
            `(screen-pos ,(id (xy-to-dmo (xy-loc feat) t))
                         value ,(value feat)
                         color ,(color feat)
                         direction ,(direction feat))
            :obj (screen-obj feat)
            :where :external))

I don't do it that way just because it's a lot of redundant code, but you can do it however you like.

Your methods can also be more complicated, and construct DMO's out of collections of features. This is how RPM handles text, for instance. If you need help doing this, please contact me (Mike) directly.