Chapter 3: Recognizing Objects

Report 0 Downloads 66 Views
Tuesday, Jan 22, 2013

Chapter 3: Recognizing Objects Visual Perception - vision is the dominant sense, which is reflected in how much brain area is devoted to vision Form Perception - the process through which you manage to see what the basic shape and size of an object are Object Recognition - the process through which you identify what the object is Why Is Object Recognition Crucial? - virtually all knowledge and all use of knowledge depend on form perception and object recognition - object recognition is also crucial for learning - in virtually all learning, one must combine new information with information learned previously - to do this you must categorize thing properly Beyond the Information Given - object recognition begins with the detection of simple visual features - Gestalt Psychology - the perceptual whole is often different from the sum of its parts - Jerome Bruner voiced similar claims and coined the phrase “beyond the information given” to describe some of the ways that our perception of a stimulus differs from (and goes beyond) the stimulus itself - e.g. the Necker Cube is an ambiguous figure because there is more than one way to perceive it

Tuesday, Jan 22, 2013

- your perception of the cube is not neutral, you perceive it as having one configuration or the other (goes beyond the information given) - e.g. the vase and 2 faces is also neutral with regard to perceptual organization - it is neutral with regard to figure/ground organization, the determination of what is the figure and what is the ground - your perception is not neutral, it specifies whether you are looking at 2 faces or a vase Organization and “Features” - one might suppose our perception of the world proceeds in 2 broad steps: (1) we collect information about the stimulus, so that we know what corners or angles are contained in the input, (2) we interpret the information, and presumable it would be in this second step that we “go beyond the information given” - deciding how the form is laid out in depth - this is wrong, we do not just “pick up” the information and our organization of input happens before we start cataloguing the input’s basic features - e.g. these black shapes initially seem to have no meaning, but after a moment, most people discover a word hidden in the figure

- this means that at the start, the form seems not to contain the features needed to identify the letters, once it is recognized it does contain those features and the letters are immediately recognized - the features are as much “in the eye of the beholder” as they are in the figure itself - e.g. most of the features needed for this recognition are absent from the figure but you easily provide the missing features

- these create a potential paradox, on one side our perception of any form must start with the stimulus and must be governed by what is in that stimulus - on the other side these last 2 examples suggest the opposite is also the case: that the features one finds in an input depend on how the figure is interpreted - therefore, it is the interpretation, not the features that must be first

Tuesday, Jan 22, 2013

- the solution hinges on the fact that the brain relies on parallel processing (analyzing a pattern’s basic features at the same time as analyzing the configuration) The Logic of Perception These steps of perceptual organization matter for us in many ways: - the organization determines our most immediate impression of the stimulus - decides whether a form will be recognized as familiar or not - what matters for familiarity is the figure as perceived - the figure plus the specifications added by the perceiver In what sense is perception “logical”? - interpretation achieved by the perceptual system must fit with all the incoming stimulus information (hypothesis) - the perceptual system acts just like a scientist - avoiding overly elaborate explanations of the “data” is a simpler explanation will do - the perceptual system seems to avoid interpretations that involves coincidences - e.g. this is initially perceives as 2 lines crossing but could also be perceived as 2 V shapes (which would rely on a coincidence that the forms just happen to be in exactly the right position)

Object Recognition - with the form organized, we are ready to take the next steps toward identifying what the form actually is - start with examining how we recognize letters and words Recognition: Some Early Considerations - you can recognize a high number of patterns, various actions and different sorts of situations - you can also recognize objects even when your information is partial e.g. still recognize a cat if only its head and paw are visible behind a tree - you can recognize tends of thousands of different swords as well - recognition of objects is influenced in important ways by the context - e.g. you read this as THE CAT not TAE CHT

Tuesday, Jan 22, 2013

Features - one plausible suggestion is that many objects are recognized by virtue of their parts or features (vertical lines, cerise) - the features that are relevant here are not the features of the raw input, we use the ones in our organized perception of the input - e.g. we recognize all the forms in this as triangles even those the essential features are present in only one of them

- the features are however, present in our organized perception of the forms, and this returns us to our proposal: we recognize objects by detecting the presence of the relevant features Advantages of a feature-based system: 1. features could serve as general-purpose building blocks 2. people can recognize many variations on the objects they encounter 3. several lines of data indicate that features do have priority in our perception of the world - e.g. in a visual search task, participants have to indicate whether a certain target is or is not present in a display - if the target is a single prominent features, the task is absurdly easy - the visual system does not inspect each of the figures, the difference jumps out immediately 4. other results suggest that the detection of features is a separate (and presumably early) step in object recognition

Tuesday, Jan 22, 2013

- evidence comes from people who suffer from Integrative Agnosia, caused by damage to the parietal cortex, they can recognize features but cannot recognize how the features are bound together to form complex objects - studies in which transcranial magnetic stimulation (TMS), was used to disrupt portions of the brain in healthy individual found that disruption of the parietal lobe has no impact on single features but did on conjunction of features Word Recognition Factors Influencing Recognition - in many studies, participants have been shown stimuli for just a brief duration (20 or 30 ms) - older research used a Tachistoscope, a device designed to present stimuli for precisely controlled amounts of time, modern research sues computers - each stimulus is followed by a post-stimulus Mask - often just a random jumble of letters, which serves to interrupt any continued processing that participants might try to do for the stimulus - people can recognize these stimuli depending on how familiar the stimulus is and how recently they had seen them - the first exposure Primes the participant for the second exposure, this is a case of Repetition Priming The Word-Superiority Effect - words themselves are easier to perceive, as compared to isolated letters - this effect is usually demonstrated with a “two-alternative, forced-choice” procedure e.g. showing a K or DARK and asking was there an E or a K - the accuracy rates are higher in the word condition so apparently recognizing words is easier than recognizing isolated letters Degrees of Well-Formedness - it is easier to recognize an E if it appears within a string (FIKE) than it is if the E appears in isolated - having the context is helpful even if the context if neither familiar or meaningful - all contexts provide an advantage - a letter strike like “GLAKE” is far easier to recognize than “JPSRW” even though it is no more familiar

Tuesday, Jan 22, 2013

- what matters is whether the string, familiar or not, is well formed according to the rules of language - if the string is well formed, then it will be easier to recognize and will produce a wordsuperiority effect - following the rules is a matter of degree, and so a letter string that follows the rules closely with produce a stronger word-superiority effect How do we assess “resemblance to English?” - one way is through pronounceability - one way is in statistical terms (how often the letter P follows the letter J) - which letter combinations are likely and which are rare Making Errors - the influence of spelling patterns emerges in the mistakes we make - errors are systematic: there is a strong tendency to misread less-common letter sequences as if they were more-common patterns, irregular patterns are misread as if they were regular patterns - e.g. “TPUM” is likely to be misread as “TRUM” or “DRUM” - these errors are sometimes quite large - misspelled words, partial words or nonwords are read in a way that brings them into line with normal spelling - these errors are referred to as Over-Regularization Errors Feature Nets and Word Recognition The Design of a Feature Net - imagine we want a system that will recognize the word “CLOCK” - how might a “CLOCK” detector work? - perhaps each individual letter has a detector, but how do they work? - perhaps the L-detector is wired to a horizontal-line detector, and a vertical-line detector and maybe a corner detector - the idea is that there is a network of detectors, organized in layers, with each subsequent layer concerned with more complex, larger-scale objects - the bottom layer is concerned with features, and this is why networks of this sort are often referred to as Feature Nets

Tuesday, Jan 22, 2013

- at any point, each detector would have a particular Activation Level, which reflects how activated the detector is at just that moment - the activation level will eventually reach the detectors Response Threshold, and at that point the detector will Fire (send its signal to other detectors it’s connected to) -

no one is suggesting that detectors are neurons, or even large groups of neurons detectors would presumably involve much more complex assemblies of neural tissue within the net, some detectors will be easier to activate than others this readiness to fire is determined in part by how activated the detector was to begin with - what determines a detectors starting activation level? - detectors that have fired recently will have a higher activation level (warm-up effect) - detectors that have fired frequently in the past will also have a higher activation level (exercise effect) - frequent words appeared often in the things you read, so the detectors needed for recognizing these have been frequently used and have a high activation level - this, even a weak signal (brief presentation) will bring these detectors to their response threshold and will be enough to make the detectors fire - repetition priming - presenting a word once will cause the relevant detectors to fire, once they have fired, activation levels will be temporarily lifted, therefore only a weak signal will be needed to make the detectors fire again The Feature Net and Well-Formedness -

the net we’ve described so far cannot explain all of the data the effects of well-formedness can’t be explained in terms of letter detectors e.g. the same letters are used in “PIRT” and “ITPR” yet one is easy to recognize the difference also can’t be explained in terms of word detectors, non of these letter sequences is a word

Tuesday, Jan 22, 2013

- one option is to add another layer to the net, a layer will with detectors for letter combinations - we have added a letter of Bigram Detectors - detectors of letter pairs - these detectors, like all the rest, will be triggered by lower-level detectors and send their output to higher-level detectors - like other detectors, bigram detectors will start out with a certain activation level

- this turns out to be all the theory we need - for “RSFK”, none of these letter combinations is familiar, so this will receive no benefits from priming and a strong input will be needed to bring the relevant detectors to threshold, and so the string will be recognized only with difficulty Recovery from Confusion - imagine presenting the word “CORN” for 20 ms - the quantity of incoming information is small and so some of the word’s features may not be detected - the fast presentation of O wasn’t enough to trigger all of the feature detectors, only the bottom-curve detector is firing and it will only be weakly activated - this is the same for U, Q, and S, the detectors are responsive to the bottom-curve feature detector so they will be weakly activated - at the next level, the C detector is strongly activated because the C was clearly visible - the detectors for O, U, Q, and S are all weakly activated so the bigram detectors for CO, CU, CQ, and CS are all receiving the same input: a strong signal from one of their letters and a weak signal firm the second letter - the CO detector is well primed (frequent pattern) and so the input probably will be enough to fire this detector - the CU, CQ and CS detectors if they even exist, are not primed at all

Tuesday, Jan 22, 2013

- at the letter level, there was confusion about the input letter’s identity - several detectors were firing equally - at the bigram level, only one detectors responds, so the confusion has been sorted out

Ambiguous Inputs - the mechanism just describes also helps in explaining some other evidence

- in this example, when the ambiguous character is in view, it will trigger some of the features normally associated with an A and some normally associated with an H - the A and H detector will fire weakly - the T detector will fire well and a moderate signal well be sent to both the TH and TA detectors and likewise to the THE and TAE detectors - the TH detectors is enormously well primed and so is the THE detector - a similar explanation will handle the word-superiority effect - imagine we present the letter A in the context AT - if the presentation is brief enough, participants may see very little of the A, maybe just the horizontal crossbar, can’t distinguish between A, F or H - all 3 detectors would fire weakly

Tuesday, Jan 22, 2013

- imagine is the participants did perceive the second letter - the AT bigram is far better primed than the FT or HT bigrams - context allows you to make better use of what you see Recognition Errors - there is a downside to all of this - e.g. is you present “CQRN” and the presentation is brief enough, participants will register only a subset of the string’s features - the second letter will only register the bottom curve, activating the Q, U and O detector - due to priming, the person will think it says “CORN”

- demonstrates how over-regularization errors come about - the network is biased, favoring frequent letter combinations over infrequent ones Distributed Knowledge - we’ve considered several indications of knowledge of spelling patterns is “built into” the network - the sense in which the net “knows” these facts about spelling or the sense in which it “expects” things or “makes inferences” is worth emphasizing - this bit of knowledge is not stored anywhere - this memory is manifest only in the fact that the CO detector happens to be more primed than the CF detectors - the CO detector doesn’t “know” anything about this advantage

Tuesday, Jan 22, 2013

- each simply does its job and occasions arise that involve a “competition” between these detectors which will be “decided” in a straightforward way, by activation levels - expectations and inferences emerge as a direct consequence of the activation levels - the networks “knowledge” is not Locally Represented anywhere - we need to look at the relationship between levels of priming and how this relationship will lead to one detector being more influential than the other - the knowledge about brigram frequencies is Distributed Knowledge - represented in a fashion that’s distributed across the network, and detectable only if we consider how the entire network functions - what is remarkable about the feature net lies in how much can be accomplished with a distributed representation and thus with simple, mechanical elements correctly connected to one another - the net appears to make inferences and to know the rules of English spelling - the the actual mechanics involves neither inferences nor knowledge - the activity of each detector is locally determined - influenced by just those detectors feeding into it Efficiency Versus Accuracy - the network does make mistakes, misreading some inputs and misinterpreting some patterns - these mistakes are produced by exactly the same mechanisms that are responsible for the network’s main advantages - perhaps we should view the errors as simply the price one pays in order to gain the benefits - do we really need to pay this price? - outside of the lab we see stimuli for long periods of time so we can rely on fewer inferences and gain a higher level of accuracy - to maximize accuracy, we could scrutinize every character on the page - the cost would be insufferable: reading would be unspeakably slow - one can make inferences about a page with remarkable speed by reading only some of the letters and making inferences about the rest - one doesn’t need every lettr to identify what a wrd is, oftn the missng letter is perfctly predctable from the contxt, virtually guaranteering that inferences will be correct - the efficient reader is not being careless, given the redundancy of test, and the slowness of letter-by-letter reading, the inferential strategy is the only strategy makes sense

Tuesday, Jan 22, 2013

Descendants of the Feature Net - variations on the network idea The McClelland and Rumelhart Model - there is a feature base and a network of connection, as various detectors serve to influence one another - this net is better able to identify well-form strings than irregular string and also is more efficient in identifying characters in context as opposed to characters in isolation - several attributed can do this without bigram detectors - activation of one detector serves to activate other detectors - excitatory connections - they also can inhibit - inhibitory connections - allows for more complicated signaling, higher-level detectors (word detectors) can influence the lower-level detectors and detectors at any level can also influence the other detectors at the same level - e.g. TRIP is briefly shown and only the R,I, and P are identified - detectors will fire activating the detector for TRIP and inhibit the firing of all the other word detectors e.g. TRAP and TAKE - activation of the TRIP detector will excite the detectors for its component letters (detectors for T, R, I and P) - R, I and P detectors were already firing so this extra activation “from above” has little impact - the T detector was not firing before and this weak presentation was unprimed - if the activation of the word detector for TRIP implies that this is a context in which a T is quite likely, the network “prepares” itself for T - detection of a letter sequence makes the network more sensitive to elements that are likely to occur within that sequence - there is evidence that higher-level detectors can activate lower-level detectors - biological evidence has been found for this sort of two-way communication in the nervous system - visual processing is not a one-way process

Tuesday, Jan 22, 2013

Recognition by Components - what about recognizing objects other than printed language? do they use a feature network? - Recognition by Components (RBC) Model - an intermediate level of detectors, sensitive to Geons (geometric ions) - geons serve as the basic building blocks of all the objects we recognize, they are simple shapes, such as cylinders, cones and blocks - we need at most 3 dozen different geons to describe every object in the world - the network uses a hierarchy of detectors - the lowest-level detectors are feature detectors, which respond to edges, curves, these activate geon detectors - higher levels of detectors are sensitive to combinations of geons - geons are assembled into more complex arrangements called “geon assemblies” - these assemblies activate the Object Model, a representation of the complete, recognized object

Tuesday, Jan 22, 2013

There are advantages of this method: - geons can be identified from virtually any angle of view, so recognition is based on geons as Viewpoint-Independent - most objects can be recognized from just a few geons - recognition of objects is relatively easy if the geons are easy to discern, more challenging if they are ambiguous

Recognition via Multiple Views - a number of researchers have offered a different approach to object recognition - they propose that people have stored in memory a number of different views of each object they can recognize

Tuesday, Jan 22, 2013

- to recognize the object, one must match the current view of the object with one of these views in memory - the number of views in memory is limited, so in many cases the current view won’t line up with any of the available images - one needs a time-consuming process of “rotating” the remembered view into alignment with the current view - recognition will be Viewpoint-Dependent - our recognition is faster from some angles that others - how does recognition proceed? - one proposal involves processes very must like those in the networks we’ve been discussing - hierarchy with layers with detectors of what the object looks like from a particular vantage point - the view-tuned representations are probably supported by tissue in the inferotemporal cortex, near the terminus of the what pathway - there continues to be debate between advocates of the RBC approach and multipleviews approach Different Objects, Different Recognition Systems? - can other sorts of recognition (sounds, faces, smells) be approached the same way, with a network? - the evidence is mixed - many domains do show similar mechanisms - one category of input seems to require a different sort of recognition system: faces

Faces - Prosopagnosia - people lose their ability to recognize faces - this implies the existence of a special neural structure involved almost exclusively in the recognition and discrimination of faces - face recognition is distinction in its very strong dependence on orientation - the perception of faces is strikingly different when we view faces upside down - people with prosopagnosia can also lose the ability of being able to recognize other things like types of birds if you’re a birdwatcher - the fusiform face area (FFA) is specifically responsive to faces - tasks requiring subtle distinctions among birds also produce high levels of activation in this area

Tuesday, Jan 22, 2013

- the neural tissue “specialized” for faces isn’t used only for faces - the system seems crucial whenever a task has 2 characteristics: (1) the task has to involve recognizing specific individual within a category and (2) the category has to be extremely familiar - authors suggest that humans have one object-recognition system specialized for the recognition of a pattern’s parts and the assembly of those parts into larger whole - a second system is specialized for the recognition of Configurations, this is less able to analyze patterns into their parts and less able to recognize these parts but it is more sensitive to large - damage to the second system is presumably what underlies prosopagnosia - damage to the first system disrupts the ability to recognize words, objects or any other target that is identified by its inventory of parts Top-Down Influences on Object Recognition - we need at least 2 recognition systems: the part-based system and the configuration system - with targets for which the feature net is useful (print, common objects), even in this domain, it turns out that the feature net must be supplemented with additional mechanisms - the net is plainly needed as part of our theoretical account The Benefits of Larger Contexts - telling a person that you will be showing them a word of something you can eat, tachistoscopically show the word “CELERY” resulting in large priming effect - first, the person needs to understand each of the words in the instruction - second, the person must understand the syntax of the instruction - third, the person has to know some facts about the word - the instance of priming relies on a broad range of knowledge Interactive Models - there are 2 types of priming and with that, 2 types of processing - the focus so far has largely been on Data-Driven or Bottom-Up Processing - incoming information triggers a response by feature detectors, which in turn triggers a response by letter detectors etc - object recognition also involves Concept-Driven or Top-Down Processing - processing that is driven by a broad pattern of knowledge and expectations, including knowledge that cannot easily be understood as an echo of frequent or recent experiences

Tuesday, Jan 22, 2013

- most contemporary models of object recognition involve both top-down and bottom-up components and are said to be Interactive Models (like the word-recognition model) - concept-driven priming relies on knowledge that is separate from one’s knowledge about letters, words and letter combinations - we cannot view object recognition as a self-contained process - knowledge that is external to object recognition is imported into and clearly influences the process