IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
759
Human-Automated Judge Learning: A Methodology for Examining Human Interaction With Information Analysis Automation Ellen J. Bass and Amy R. Pritchett
Abstract—Human-automated judge learning (HAJL) is a methodology providing a three-phase process, quantitative measures, and analytical methods to support design of information analysis automation. HAJL’s measures capture the human and automation’s judgment processes, relevant features of the environment, and the relationships between each. Specific measures include achievement of the human and the automation, conflict between them, compromise and adaptation by the human toward the automation, and the human’s ability to predict the automation. HAJL’s utility is demonstrated herein using a simplified air traffic conflict prediction task. HAJL was able to capture patterns of behavior within and across the three phases with measures of individual judgments and human–automation interaction. Its measures were also used for statistical tests of aggregate effects across human judges. Two between-subject manipulations were crossed to investigate HAJL’s sensitivity to interventions in the human’s training (sensor noise during training) and in display design (information from the automation about its judgment strategy). HAJL identified that the design intervention impacted conflict and compromise with the automation, participants learned from the automation over time, and those with higher individual judgment achievement were also better able to predict the automation. Index Terms—Decision-making, human–automation interaction, information analysis automation, judgment analysis, lens model, training, uncertainty, user modeling.
I. I NTRODUCTION A. Human Interaction With Information Analysis Automation Judgment, defined as the assessment of an attribute of the environment through consideration of available information, is a critical component of many human activities. In a judgment task, the goal is to assess the true state of the situation (i.e., the environmental criterion) from the information available. Making judgments involves acquiring information from the environment and transforming it into an assessment of the criterion. Information analysis automation [1] can assist humans in making judgments by processing retrieved environmental in-
Manuscript received August 1, 2004. This work was supported in part by the Naval Air Warfare Center Training Systems Division under Contract N61133999-C-0105. This paper was recommended by Associate Editor R. Hess. E. J. Bass is with the Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22904 USA (e-mail: ejb4n@ virginia.edu). A. R. Pritchett is with the Schools of Aerospace Engineering and Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCA.2008.923068
formation before a decision is made. It may fulfill all or any of the following: 1) Convert raw data from one or more sensors into a form that has an easier-to-understand relationship to the criterion. 2) Compare current sensor input to databases or models of system behavior in order to make an assessment of performance, hazard, or abnormality (e.g., [2]). 3) Use statistical and pattern recognition techniques to detect or highlight trends, patterns, or conditions (e.g., [3]). 4) Assemble multiple sources of information into a single unified assessment (e.g., [4]). Information analysis automation can be a component of decision support systems that provide recommendations. That is, decision support systems may be seen as the combination of judgment functions with mechanisms for selecting between alternatives. Information analysis automation is also an important component of alerting systems that integrate multiple sources of information, monitor for trends, and use computationally intensive algorithms and large databases to make an assessment of the potential hazard [5]–[8]. As information analysis automation is incapable of acting upon the environment directly, it exists fundamentally to interact with humans. In the simplest case, the automated system would make perfect judgments, allowing the human to rely on it in all cases. While people perform better when supported by more accurate information analysis automation and may perform even worse with inaccurate automation than if completely unaided [9], no automated system can make assessments perfectly in all task contexts and environmental conditions. Thus, the human is making a judgment in parallel with the automation and is expected to discriminate between the automation’s good and bad judgments. This discrimination may take several forms, ranging from complete reliance on the automation’s judgment to making a separate judgment concerning the correctness of the automation’s judgment and to ignoring the automation completely. Understanding human–automation interaction is complicated by the impact of the automation’s judgments on the human’s information seeking, cue utilization, and judgment policy [7], [10]–[16]. For example, many problematic interactions are hypothesized to be the result of “automation bias” [12] (i.e., the use of automation as a heuristic replacement for vigilant information seeking and processing). Overreliance can occur when the human decides to shed the judgment task and rely on the automation. Underreliance (i.e., less use of the automation
1083-4427/$25.00 © 2008 IEEE
760
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
than intended by its designer) is also an issue. For example, nonconformance to automatically generated alerts is commonly cited [12], [16]. Over time, people learn through experience in the task environment as well as from the automation. They can thereby tailor if/how the automation’s outputs are considered [15], [17], [18]. Performance can be affected by many other factors such as perception of risk, self-confidence, and trust [19]–[23], each of which can be dynamically changing. Many of these factors will evolve with time and with increased experience with the system, particularly as the human judge learns about and adapts to the automation. One aspect of the larger environment that will impact these factors is the information available to the human judge (in terms of both content and representation). The display of the automation’s judgments may not represent the rationale, criteria, and determining factors used in forming its judgment, thus impacting human-judgment performance (e.g., [9], [24]–[26]). The complexity of the automation’s algorithms, the context in which they are employed, and how well their judgment strategies correspond with the strategies of the human judge can affect how/ whether the human judge uses the automation’s output (e.g., [13], [27]). Highly complex systems may be ignored if they utilize different judgment strategies than their operators or are difficult for the operators to understand [13], [28], [29]. Strategies that are too simplistic may be considered nuisances [7]. The uncertainty considered by the automation’s algorithms and how that uncertainty is represented impact humanjudgment performance (e.g., [30]).While researchers are working to develop modeling methodologies that use modal structures to explicitly represent algorithm design [31], [32] and environmental uncertainty [2], these solutions do not fully consider the interaction between the design and the human’s judgment behavior. For example, even automation that does not consistently apply its judgment strategy and is not consistently accurate can enhance human-judgment performance over another unreliable one that does not display its strategy [9]. Other design strategies use automation to help identify environmental constraints so that the judgment becomes clear (e.g., [33]–[35]).
B. Measuring the Interaction Between the Human Judge and Information Analysis Automation Improved theoretical constructs and quantitative methodologies are necessary to understand and evaluate human– automation interaction. Measures of human–automation interaction with information analysis automation should be able to examine human judgment with and without automation across a variety of conditions. They should be generalizable for each individual across a range of conditions within the task environment and able to guide individual-specific interventions such as personalized training. Once these measures have been collected from a sufficient number of individuals, they can be part of a systematic experimental design examining aggregate effects on performance and behavior. This approach is suitable for demonstrating the impact of interventions such as changes in the automation and the information displayed to the human judges. For validity, the testing method should collect these measures in a realistic task environment. For practicality, the
testing should not require a prohibitive amount of data, additional expensive apparatus, or specialized testing personnel. The measures should decompose less-than-perfect judgment performance to account for the judgment processes of each judge (both the human and the automation), the impact of uncertainty in the environment, and the interactions between them. Based on the human–automation interaction [13], [36]–[38], judgment analysis [39], and interpersonal learning (IPL) [40]–[42] literature (where the latter two are discussed in more detail in the next section), measures should, at a minimum, account for the following aspects of human interaction with information analysis automation: 1) cues used by each judge (i.e., human and automation) in rendering judgments [9], [43], [44]; 2) the uncertainty in the relationship between the cues and the task environment [9], [30], [36], [37], [43]; 3) judgment strategies by which the judges combine cues [9], [36]–[38], [43], [44]; 4) the consistency with which these judgment strategies are applied [9], [36]–[38], [43]; 5) the amount of conflict between the judges [37], [38]; 6) whether the human judge compromises with the automation and, if so, how [38]; 7) whether the human judge adapts to the automation and, if so, how [38]; 8) the ability of the human judge to predict the automation’s judgments and the human’s perceived similarity between their judgments [38], [45]. Bisantz and Pritchett [37] used judgment analysis techniques to examine human and automated judgments. They demonstrated the ability of such techniques to assess conflict between a human judge and what the automation would have assessed. They also showed how these techniques could be used to decompose the contributions of judgment knowledge and application consistency to the application of either’s judgment strategy. The results provided insight into the first four measures listed previously. However, this paper examined only the human and automation’s independent judgments, not their interaction. Seong and Bisantz [9] combined judgment analysis techniques with subjective estimates to model human–automation interaction. They manipulated the quality of the automation’s judgments and the display of its feedback. Results showed that performance and assessments of trust were impacted by the automation’s quality and that participants used feedback effectively to compensate for a poor-performing aid. While they explicitly measured trust and manipulated feedback about the automation, they did not measure the human judge’s ability to predict the aid. Such a measurement can evaluate trust in and reliance upon the automation (see [46] and [47] for discussions about the relationship between trust and one’s knowledge of the automation). To address the desired measures (previously discussed), this paper presents a methodology for investigating human interaction with information analysis automation: human-automated judge learning (HAJL), as previously published, in part, in [38], [48], and [49]. After describing HAJL, this paper introduces a task and experimental design used as a test case for investigating HAJL’s utility. Then, idiographic results (representative of the insights that HAJL can enable) and a nomothetic analysis of
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
Fig. 1.
761
Double-system lens model. (a) Relationship between judgment, the criterion, and the cues. (b) Double-system lens model parameters.
the experimental manipulations are presented. This paper ends with conclusions surrounding HAJL’s potential utility. II. HAJL HAJL builds on research from judgment analysis [39] and IPL [40]–[42] and applies these concepts from the social sciences to studying human–automation interaction, a practice becoming more common among researchers developing sociotechnical system frameworks (e.g., [20], [21], and [50]). A. Judgment Analysis and the Lens Model Judgment analysis has proven successful in modeling human judgment in several domains including social policy making, medicine, weather forecasting, sonar assessments in a simulated tactical submarine environment, and education [36], [43], [51]–[58]. It is based on Brunswik’s probabilistic functionalism, which designates the organism–environment interaction as the primary unit of study [59], [60] [Fig. 1(a)]. Based on ecological theory, it considers both internal (cognitive) and external aspects of judgment. By explicitly modeling judgment behavior across a range of conditions in the task environment,
judgment analysis is also able to predict the impact of changes in the environment. 1) Double-System Lens Model: Cues may have associated uncertainty which limits the predictability of the criterion and, therefore, judgment. Thus, investigating a judgment’s achievement (i.e., its correspondence to the criterion) requires consideration of the relationship between the criterion and the cues (i.e., ecological validity). Aspects of the judge’s policy, including the relationship between the cues and the judgment (i.e., cue utilization) and the judge’s consistency in applying the policy, should also be considered. Using a double-system lens model, a decomposition of judgment achievement can be established by fitting the judgments and the criterion to symmetric linear models using the cues [Fig. 1(b)]. While the internal cognitive processes used by humans in forming judgments may be complex and unobservable, linear models have been found to provide powerful descriptors of them (however, see [61]–[63] for other models). For example, Goldberg [64] described how linear models reliably predicted clinical judgments such as psychological diagnoses, even when judges reported using more complex nonlinear strategies. Dawes [65] described how even nonoptimal linear models (e.g., with nonoptimal weights) can successfully predict judgments. Einhorn et al. [66]
762
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
TABLE I SUMMARY OF LME MEASURES
as a result of this regression analysis. As such, it is the similarity between judgments and predictions of the judgments based on a particular judgment policy model. Environmental predictability Re measures the degree to which the criterion is predicted by the linear model of the environment. Re is calculated as the coefficient of multiple correlation of the environmental linear regression model (regressing the environmental criterion on the cue values). Finally, unmodeled knowledge C measures the extent to which the judgment and the environment share the same nonlinear components. It is computed as the correlation between the residuals of the environment and the judgment policy models
found that linear models can replicate seemingly more cognitively representative process tracing or rule-based models of judgment. The linear model on the left-hand side of Fig. 1(b) establishes a best fit between the environmental criterion (Ye , where e stands for the environment) and the cues. The criterion is related to each cue i by statistical correlations rie ’s (the ecological validities). The ecological validity of a cue measures how well it specifies the true state of the criterion. Output of the environmental model, Yˆe , is estimated by applying the regression coefficients βie ’s. The linear model on the right-hand side of Fig. 1(b) establishes a best fit between the judgment (Ys , where s stands for the human subject) and the cues. The judgments are related to each cue i by statistical correlations ris ’s (the cue utilizations). The pattern of cue utilizations exhibited by a judge determines his judgment policy. Output of the cognitive model, YˆS , is estimated by applying the regressions coefficients βis ’s as cue weights. As cues may be correlated, relationships among the cues, rij , are also a part of the description of the environment. Achievement ra (i.e., correspondence between judgments and the environmental criterion) is maximized when the pattern of cue utilizations mimics the pattern of ecological validities. The lens model equation (LME) [67], [68] partitions judgment achievement into lower level correspondences accounting for the contributions of the environment and judge (Table I) (1) ra = GRe Rs + C 1 − Re2 1 − Rs2 .
C = ρ(Ye − Yˆe , Ys − Yˆs ).
Linear knowledge G measures how well the predictions of the linear model of the judge match predictions of the linear model of the environment. Thus, it measures how well a modeled judgment policy captures the linear structure of the environment. G is computed as the correlation between the outputs of the environment and judgment policy models G = ρ(Yˆe , Yˆs ).
(2)
Limitations in knowledge are associated with a failure to detect task properties and to correctly understand the reliabilities of the cues (for a review, see [69]). Even though a judge might have perfect task knowledge, performance can be limited by the judge’s inability to apply that knowledge in a controlled and consistent fashion over time or between cases [70]. Cognitive control Rs measures the degree to which a judgment is predicted by the linear model of the judge. It is calculated by regressing human judgments on the cue values. Rs is the coefficient of multiple correlation obtained
(3)
A high value for C may reflect nonlinearity in the actual ecological validity and cue utilization relationships, missing cues in the linear models of judgment and environment, and/or chance agreement between the residuals. Should the linear models perfectly fit to the actual criterion and judgments, then C will be equal to zero, and ra will simply be the product of G, Re , and Rs . Thus, the LME measures explicitly capture several important factors impacting judgment such as inaccuracies in assessments stemming from the stochastic nature of the environment, the human judge’s knowledge of the task, and his ability to apply that knowledge. Judgment studies show that experts may not be similarly calibrated with respect to how they assess the same situation, even when they work in similar environments (e.g., [71]). Judgment analysis studies show that characteristics such as cue validity [72], cue redundancy [73], cue–criterion function forms [74], and task predictability [74] can affect judgment policies and their application. 2) N -System Lens Model: As an extension to the double system, the n-system lens model incorporates more judges. Fig. 2(a) shows a triple-system lens model in which two judges are assumed to be acting independently with reference to the same cues. The achievement of each judge relative to the criterion can be determined. Also, a low correlation between the two judges’ judgments will provide an indication of conflict between the two judges. A value of 1.0 indicates that the two judges provide identical judgments (i.e., a complete lack of conflict). In quantifying this representation, the judgments of each judge can be fit to a linear model, and the detailed properties of their judgments can be determined as described in the previous section. In addition, the linear models of the two judges can be compared [Fig. 2(b)]. The policy similarity of the two judges can be identified by comparing the output of their linear models. Likewise, the unmodeled knowledge shared by each model can be calculated between the two. Combined with the cognitive control of each judge, these parameters will together decompose the conflict measure. B. HAJL Methodology HAJL employs a triple-system lens model and adapts concepts and constructs from the IPL literature. IPL’s approach involves the analysis of the processes used to learn about the judgments of another person while also considering environmental features, the other person’s responses in relation to the
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
763
Fig. 3. IL phase lens model of independent human and automation judgments, and human judgment provided with automation’s judgment: Conflict, compromise, and adaptation.
Fig. 2. Triple-system lens model. (a) Independent judgments and their conflict. (b) Triple-system lens model parameters.
environment, and the learner’s processes with respect to the environment and the other human judge. This analysis measures how a human judge may conflict with, compromise toward, and adapt to another judge. Researchers in interpersonal conflict and IPL have applied judgment analysis techniques to examine conflict between different judges working on the same task (e.g., [40]–[42], [58], [72]–[75]). As people can learn both individually and interpersonally, the IPL methodology studies how two people learn to make inductive judgments from each other and from the task environment. Representative results from IPL studies include the findings that conflicts between human judges can occur because they consider different cues, use different judgment policies, or consistently apply these policies and thus create false disagreement [40], [72], [73]. A typical IPL study engages two people in a common task about which they have differing partially valid beliefs. Using the multiple-cue probability learning paradigm [76], [77], the participants first separately attempt to learn the relationship of some distal criterion variable by attending to multiple fallible indicators. Different belief structures can be specifically constructed via disparate individual training at the start of the study. For example, each person can be trained by using different scenarios that encourage the utilization of different cues or the same cues but weighted differently [40]. When brought together, the pair makes judgments, often in task conditions that differ from their individual training or experience and thus require them to compromise. Because, initially, each of them is partially correct for different reasons, they can learn from each other as well as from the task environment. The two judges are later separated and asked to make new judgments as well as to predict what the other person would have estimated. Therefore, experimental design can be used to disambiguate between task learning and IPL by creating appropriate experimental conditions (e.g., [41] and [77]). HAJL adapts the three phases used in IPL studies to capture different types of interaction with automation, which are detailed in the following sections.
1) First Phase—Training: As with IPL, HAJL first focuses on training the individual judge. In this case, only the human judge participates (the automation’s judgment algorithms having been established during its design). The IPL paradigm often intentionally creates conflict between judges by training them in different environments or to use different judgment strategies. The resulting cognitive conflict is often the desired subject of study [40]. In contrast, studies using HAJL would more likely train the human to make individual judgments in a realistic environment rather than artificially create conflict with the automation. In this way, HAJL can serve to predict whether conflict between a trained human and automation will occur. This training may be conducted in any of several ways as suited to the situation, including being fit into an established training program or employing specific training interventions (where the human judges are trained to a criterion or are provided with a set number of training trials). This phase also captures the human’s individual judgment policy at the end of training for comparison with judgments in the subsequent phases. Alternatively, it is possible that no formal training is provided and that the human judge brings prior domain experience directly to the next phase. 2) Second Phase—IL: In interactive learning (IL), the human and automation first make independent initial judgments. The correspondence between their initial judgments measures their conflict, where lower values indicate more conflict (Fig. 3). Unlike the human judges tested in IPL, automation typically cannot “negotiate” a joint judgment. However, it can highlight the cues it relies upon or otherwise portray its judgment policy, and can supply this information to the human. Thus, in each trial, the human provides a “joint” judgment using the automation’s judgment as an additional cue. This phase mimics a human judge making judgments prior to and after the automation provides its feedback. For example, air traffic controllers do not have to wait for a response from a conflictalerting system before they consider whether two aircraft may conflict. Correspondence between the human’s initial and joint judgments provides a measure of compromise, where lower values indicate more compromise. Correspondence between the automation’s judgment and the joint judgment measures the extent to which the human adapts to the automation. 3) Third Phase—Prediction: In the prediction phase, the human judge makes a judgment independent of the automation. He also predicts the automation’s judgment. These judgments and the automation’s allow the following three comparisons
764
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
TABLE III SUMMARY OF INFERENCES FROM HAJL MEASURES BY PHASE
Fig. 4. Prediction phase lens model of independent human and automation judgments, and human prediction of automation’s judgment: Actual similarity, assumed similarity, and predictive accuracy. TABLE II SUMMARY OF HAJL MEASURES BY PHASE
dict what the human judgment (Ys ) would have been without the interaction with the automation in later phases. The human judge’s achievement is calculated as ras = ρ(Ys , Ye ).
(4)
In addition, the training phase provides a prediction of the conflict between the human and the automation’s judgments (Ya ) formed independently raf = ρ(Ys , Ya ).
(5)
IL phase measures: The IL phase provides direct measures of human interaction with the automation, including the conflict between their independent judgments, the compromise that the human judge makes between the initial (Ysi ) and joint (Ysj ) judgments and the adaptation to the automation’s judgments in forming a joint judgment. The initial humanjudgment achievement is calculated as rai = ρ(Ysi , Ye ).
(6)
The joint judgment achievement is calculated as raj = ρ(Ysj , Ye ).
(7)
Conflict is calculated as to be made: predictive accuracy of the human (i.e., correspondence between the automation and his prediction of it), assumed similarity (i.e., correspondence between the human’s judgment and his prediction of the automation), and actual similarity (i.e., correspondence between the human and the automation’s judgments) (Fig. 4). This phase is important because it allows the analyst to determine if the human judge can predict the automation, a skill known to be critical to good performance [45] as well as to establishing trust and appropriate reliance. HAJL Measure Summary: Each phase provides measures of human judgment and interaction with the automation (Table II). These measures can be examined directly, and they can also be decomposed via the LME to identify which underlying aspects of judgment behavior are yielding differences in judgment performance. Training phase measures: The training phase, beyond its utility in training the human judge, establishes the ability to pre-
raf = ρ(Ysi , Ya ).
(8)
Compromise is calculated as ram = ρ(Ysi , Ysj ).
(9)
Adaptation is calculated as rad = ρ(Ysj , Ya ).
(10)
The initial and joint judgments’ linear models and lens model parameters may also provide useful information. For example, high cue utilization of the automation’s judgments by the human when forming a joint judgment suggests that the human is using the automation as a heuristic replacement for other cues, indicative of automation bias [14]. Beyond these direct measures of interaction, comparisons between the measures can illustrate whether the human’s reliance on the automation enhances performance (Table III). Underreliance is indicated
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
765
by a lack of compromise toward reliable automation when the human’s initial judgment achievement is low. Overreliance is indicated by a high degree of compromise toward the automation when its achievement is relatively low. The best interaction would result in a joint judgment that is higher than both the individual’s initial judgment and the automation’s judgment. Prediction phase measures: In the prediction phase, the human judge provides a judgment of the criterion as well as a prediction of what the automation would have yielded if it were available (Ysa ). Thus, similarity and predictive accuracy measures are calculable (Table II). Human judgment achievement is again calculated as ras = ρ(Ys , Ye ).
(11)
Assumed similarity is calculated as raa = ρ(Ys , Ysa ).
(12)
Actual similarity is calculated as raf = ρ(Ys , Ya ).
(13)
Predictive accuracy is calculated as rap = ρ(Ysa , Ya ).
(14)
An ability to predict the automation can provide a basis for the human to trust it (e.g., [20] and [50]). It can also provide insight into the extent to which the human’s interaction with the automation is based on accurate understanding. III. M ETHOD FOR E MPIRICAL I NVESTIGATION We present an initial test case employing HAJL to investigate performance in a simplified air traffic conflict prediction task with information analysis automation. The purpose of this paper is to investigate HAJL’s utility in diagnosing less-thanperfect judgment behavior when a human judge interacts with information analysis automation. Patterns of human– automation interaction are investigated for individual judges. Two interventions allow investigation of the methodology’s sensitivity in measuring human–automation interaction across individuals. Simplification [78], the notion that a task should be simplified so that the trainee can develop skills, was implemented by increasing the environmental predictability (by decreasing the sensor noise) in the training phase. With respect to the display design intervention, there is flexibility in how the automation represents its judgments to the human judge (e.g., [9] and [13]). In this paper, the amount of cognitive feedback [69] provided by the automation regarding its judgment strategy was manipulated. A. Experimental Task: Aircraft Conflict Prediction Under many of the concepts for aviation such as “free flight” [79], pilots will be expected to maintain safe separation from other aircraft, rather than relying on air traffic controllers. Cockpit display of traffic information (CDTI) and information analysis automation such as conflict-alerting systems are being investigated to support pilot judgment and decision making at
this task (e.g., [80]). However, pilot interaction with information analysis automation must be effective and safe, requiring careful analysis. HAJL is intended to support such an analysis. Using a cockpit simulator in straight and level cruise flight, participants monitored the progress of their own aircraft (the “ownship”) flown by an autopilot relative to traffic. White noise was added to the traffic information: Its standard deviation was 19 kn for the traffic’s indicated airspeed, 3.8◦ for its heading, and 0.11 nmi for its horizontal position. This uncertainty degraded environmental predictability to 0.886, which was perceptually apparent on the CDTI. B. Experimental Procedure This experiment included the three HAJL phases with a separate briefing for each phase. No debriefings were conducted at the end of any of the phases. In each phase, trials ran for a random preview time uniformly distributed between 15 and 30 s before participants made judgments. The participants were asked to provide judgments about the risk of losing safe separation, i.e., the probability that the traffic will be closer than 5 nmi at the point of closest approach (PCA). In the training phase, participants made individual judgments without the aid of the automation (no output from the automation was available). In the IL phase, participants formed an initial judgment independently, and then, once shown the automation’s judgment, a second joint judgment. In these two phases, once judgments were entered, participants could then view the remainder of the flight in faster than real time to assess the actual conflict outcome. In the briefing before the IL phase, the participants were told that the automation’s predictions were based on what the display showed when the scenario froze and that, if the sensor and display noise were particularly high at that time, they would impact the automation’s judgments. In the prediction phase, the information analysis automation was not available to the participants who provided an individual judgment and a prediction of the automation’s judgment. As the participants were not told that they would be asked to predict the automation in this later phase, the experiment measured incidental (the type of learning that happens in naturalistic settings [40]) as opposed to instructed learning about the automation. In the training phase, each participant judged 180 trials, which were distributed in four sessions over the first two days. In the IL phase, each judged 180 trials, which were distributed in four sessions over the third and fourth days. In the prediction phase, each participant judged 45 trials in one session on the fifth day. C. Participants and Apparatus Sixteen paid undergraduate engineering student volunteers participated: eight male and eight female. The experimental apparatus used a tailorable part-task aviation software suite [38], [81] with three display areas shown on one computer monitor: the primary flight display (PFD), the CDTI (Fig. 5), and the data entry area. Because ownship was in straight and level cruise flight controlled by an autopilot, it maintained a constant heading (174◦ ), airspeed (400 KIAS), and altitude (30 000 ft); therefore, the information on the PFD never changed substantially.
766
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
Fig. 5. Cockpit display of traffic information with automation feedback (labels added).
TABLE IV HORIZONTAL TRAJECTORY UNCERTAINTY
distribution function of the predicted horizontal miss distance around the 5-nmi safe separation boundary. D. Scenarios
The egocentric CDTI updated once a second. Ownship position was depicted by a white airplane. A compass ring with tick marks every 5◦ appeared 40 NMI from ownship. Five whitedashed range circles at 5, 10, 20, 30, and 50 NMI, respectively, indicated distance from ownship. Traffic was depicted by a green triangle pointing in the direction of track heading with a text depiction of the traffic’s indicated airspeed in knots. All aircraft were at the same altitude as ownship, so altitude was not displayed. Simulated noise was manifested using both graphic and numeric means: The traffic symbols moved with lateral position error, the orientation of the symbol changed with course error, and the displayed indicated airspeed fluctuated with airspeed error. The data entry area collected the judgments and provided trial control. To enter a value, a participant entered the probability by using the computer mouse via a slide bar. In the IL phase, the automation’s judgments were also shown in this area at the appropriate time. The automation estimated the probability of losing safe separation based on current aircraft states and the uncertainty in the environment. Aircraft positions were projected to the predicted PCA by extrapolating the errors in current aircraft position, velocity, and heading. The variance in horizontal position was calculated as a function of uncertainty parameters (Table IV). Due to uncertainty in speed and course tracking, the position error increased with distance to the PCA. At the predicted PCA, ˆ miss ) was determined. the predicted horizontal miss distance (h By using the predicted horizontal miss distance as the mean, the probability of conflict was determined from the cumulative
Scenarios were designed around conflict geometries known to affect human judgment performance (e.g., [82]). Six relative headings (±45◦ , ±90◦ , and ±135◦ ) and five indicated airspeeds (300, 350, 400, 450, and 500 kn) were selected to create the traffic encounter geometries. For each of the 30 heading-speed combinations, six trials were created with bearings selected to create conflict probability frequency distributions matching those created by aircraft distributed uniformly in horizontal space. The time to PCA was selected randomly from a uniform distribution between 20 and 340 s. E. Independent Variables and Experiment Design The first between-subject experimental manipulation investigated HAJL’s ability to assess a training phase intervention. In the “no simplification” training condition, the environmental predictability was 0.886 (achieved using the noise parameters defined in Table IV). In the “simplification” training condition, the noise in traffic state variables was halved, increasing the environmental predictability to 0.941. Regardless of their training condition, all participants experienced the no simplification condition in the subsequent IL and prediction phases. The second between-subject experimental manipulation tested HAJL’s ability to provide insight into a display design intervention. Although the automation’s judgment strategy was fixed, there was flexibility in how its judgments were represented (e.g., [9], [13], and [26]). In the “outcome only” condition, the automation’s output was available as a number. In the “automation feedback” condition, where the automation’s judgment strategy information was also available, two additional elements were added to the CDTI (Fig. 5): future positions (both traffic and ownship) at the PCA and probability contours. The automation’s estimate of ownship’s future position at the
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
PCA, with an associated 5-nmi range ring, was represented directly on the CDTI. Its estimate of the traffic’s position at the PCA was surrounded by a position error ellipse representing approximately two standard deviations. Color-coded probability contours, also shown on the CDTI, represented the distribution through horizontal space of the automation’s estimated probability of conflict. More details on the automation’s feedback can be found in [38]. This intervention was used in the IL phase, whereas the participants made joint judgments with the automation. These two manipulations were crossed to create four experimental conditions. Two male and two female participants were assigned to each condition creating a nested-factorial design with participants nested within gender, noise level in the training phase, and automation feedback in the IL phase. F. Dependent Variables and Judgment Analysis Methods All of the measures in Table II were acquired. Participants’ judgments were recorded, including their individual judgments (all phases), their joint judgment formed with reference to the automation (the IL phase), and their prediction of the automation’s judgment (prediction phase). In addition, the automation’s judgments were recorded in all phases, whether or not they were shown to the participants. The achievement of these judgments was assessed and deˆ miss composed using the LME with the following two cues: 1) h and 2) the signed reciprocal of the standard deviation of the horizontal position error. This second cue captures the noise (i.e., position error) as well as its relationship to the estimated probability of conflict. The residuals resulting from a regression ˆ miss as the model, fitted to the estimated probabilities using h ˆ miss only predictor variable, follow a U-shaped pattern when h ˆ is 5 nmi or less. The pattern is inverted when hmiss is greater than 5 nmi. Because the sign of the error changes depending upon whether the miss distance is greater or less than 5 nmi, the standard deviation is negated when the miss distance is less than this amount. These cues were not explicitly available to the participants but could be inferred by projecting the aircraft positions and observing the simulated noise manifestations described earlier. For more details, see [38]. For the probability of conflict measures, a transformation was used to stabilize the variance before performing analyses [83] √ (15) y = sin−1 ( x). As suggested by Cooksey [39], before performing the analyses of correlations, the parameters were transformed using Fisher’s r-to-zr transformation.
767
the automation. Similarly, with a reliable automated judge, low achievers should compromise more, and high performers should correctly maintain their initial judgments. In the prediction phase, the participants were expected to perform better on the conflict prediction task than the prediction of the automation due to their experience with it in the previous phases (as well as being rewarded for the conflict prediction task). Thus, predictive accuracy and actual similarity should be low. If the human judges are well calibrated concerning their understanding of the automation, and they perform poorly at the task, assumed similarity should be low. H. Hypotheses for the Nomothetic Analyses There are two independent variables of interest in this paper: a training intervention and a display design intervention. Given that uncertainty affects performance in many task domains [39], [58], [84], the participants in the simplification condition should outperform those in the no simplification condition in the training phase. For the IL and prediction phases, those trained in the simplification condition should perform better assuming that they will have built up better skills [78]. A simplification strategy in the training phase has implications for cognitive conflict in the IL phase. Human judges trained in the simplified condition may be better able to perform the task. If this learning occurs and transfers to the IL phase, a human judge may have less conflict with the automated judge (assuming that the automated judge is reliable). Conversely, human judges trained in the more difficult environment may not learn the task as well. Even though they are in the same environment as in the IL phase, they may ironically have more conflict with the automated judge. That is, due to potentially lower task knowledge and/or cognitive control, these human judges have a higher chance of both actual and false disagreements with the automated judge. Those in the automation feedback display condition in the IL phase should adapt to the automation more because they had more information concerning the automation’s judgment strategy. If the automation feedback display provided the sense that one understood the automation better, one may be more likely to use and learn from it. Similarly, in the prediction phase, those in the automation feedback display condition should perform better at predicting the automation’s judgments because they had better display information (i.e., more feedback concerning its judgment strategy) during the IL phase. Thus, they should have higher predictive accuracy and actual similarity. If the automation feedback display provided the sense that one understood the automation better, these participants should have higher assumed similarity.
G. Hypotheses for the Idiographic Analyses In the IL phase, measures from initial and joint judgments, conflict, compromise, and adaptation can illustrate patterns of interaction with the automation. For example, a high-valued conflict measure (where initial judgments are similar to the automation resulting in a lack of conflict) should coincide with high judgment achievement if both the human and the automation are good at the task. With high-achieving automation, adaptation should be high, as the joint judgments should match
IV. R ESULTS All results are reported as significant at the α = 0.05 level; trends are reported at the α = 0.10 level. A. Representative Results of the Idiographic Analyses The following results are derived from the data from four participants with different judgment behaviors in order to
768
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
Fig. 6. Idiographic analyses of individual participants (y-axis defined in the legend). (a) Summary of measures for participant 6. (b) Summary of measures for participant 11.
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
769
Fig. 6. (Continued.) Idiographic analyses of individual participants (y-axis defined in the legend). (c) Summary of measures for participant 12. (d) Summary of measures for participant 13.
demonstrate how HAJL can provide insight into an individual’s judgment performance. Participant 6—Poor Judgment Achievement Performance: Participant 6 trained in the no simplification training condition and viewed the automation feedback during the IL phase. In all phases, her probability of conflict judgment achievement was low; it was, in some sessions, near zero or negative [Fig. 6(a)]. Poor cognitive control contributed to her poor
performance, as indicated by the high correlation between achievement and the cognitive control measure. Also, given the fluctuations in the linear knowledge and unmodeled agreement across the nine sessions, she may not have settled on a judgment strategy. These measures indicate that this participant would need additional training before making such judgments in actual operations with or without the aid of the automation.
770
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
In the IL phase, her initial judgments were very different from the automation’s, as shown by the low values of the conflict measure. Her joint judgment achievement approached that of the automation, which is reflected in the progressively lower compromise measures. This indicated that the initial and joint judgments were dissimilar, which is a common pattern. However, her interaction with the automation did not noticeably improve either her linear knowledge or her initial judgment achievement. Her joint judgment achievement may reflect reliance on the automation more than learning from and about it. She had low probability of conflict judgment achievement in the prediction phase. She also had low predictive accuracy. Her predictions of the automation were very close to her probability of conflict judgments, producing an assumed similarity of 0.981. This suggests an incorrect belief that her judgments were similar to the automation. In reality, the differences between her and the automation’s judgments were very high, resulting in an actual similarity measure of 0.155. Participant 11—Better at Predicting the Automation Than Judging Traffic Conflicts: Participant 11 trained in the simplification condition and viewed the outcome only design in the IL phase. He exhibited a high level of linear knowledge throughout the phases. However, his low cognitive control often bounded his individual judgment achievement [Fig. 6(b)]. In the IL phase, participant 11’s joint judgments were better than his initial judgments. However, his patterns of compromise and adaptation to the automation were quite different in that he only partially adapted to, and partially compromised with, the automation’s judgments. Thus, he established joint judgments where achievement fell between those of his individual judgments and the automation’s judgments. Given his relatively low initial judgment achievement, this was not the most performance-enhancing strategy. In the prediction phase, participant 11 exhibited another unique pattern of behavior. Unlike any other participant, participant 11’s achievement was low (0.201), but his predictive accuracy was quite high (0.699). Thus, he was better at predicting the automation than predicting the probability of conflict judgments. The result was surprising, given the fact that he had had more motivation, experience, and feedback with the latter type of judgment. Participant 11 also assumed that his judgments were more similar to the automation than they really were (assumed similarity of 0.47 compared with an actual similarity of 0.27). A comparison of the lens model parameters for his probability of conflict judgments and predictions of the automation identifies the source of the behavior pattern. At both tasks, he had high values of linear knowledge (0.923 for his individual judgments and 0.977 for his prediction of the automation’s judgments). However, while his cognitive control at the primary judgment task was low (0.375), his cognitive control when predicting the automation’s judgments was almost twice as high (0.689). This increase in cognitive control enabled his higher achievement in the prediction task. If this insight were identified as part of a training program, his conflict prediction performance would not necessarily be improved by encouraging him to replace his current judgment strategy with that used in predicting the automation. Instead, his achievement would be improved by focusing on his cognitive control.
Participant 12—Highest Performing Joint Judgments: Participant 12 trained in the “simplification” condition and viewed the outcome only design when interacting with the automation during the IL phase. His probability of conflict judgments had the highest achievement of all of the participants. He developed linear knowledge quickly. With the exception of the first session, he also maintained a relatively high level of cognitive control [Fig. 6(c)]. In the IL phase, with the change in environmental predictability (experienced in the fifth session), participant 12 appeared to have some difficulty in maintaining his cognitive control, but he improved with practice. In contrast with all the other participants, participant 12’s probability of conflict judgment achievement improved in successive sessions to the point where his initial judgment and joint judgment achievements were higher than that of the automation. These results suggest that he was not only learning from the automation but also exploiting it in a sophisticated manner (considering it when it was of more value and ignoring it otherwise). Through successive sessions, the value of unmodeled agreement rose progressively, suggesting that participant 12 may have developed a more sophisticated use of the cues than captured in a linear model. This effect was also suggested in his debriefing where he described considering the magnitude of the noise-generated error in his judgment. To investigate this further, the noise in the displayed indicated airspeed at judgment time was coded on the basis of its absolute value: a zero if the error was less than 20, one if between 20 and 40 IAS, two if between 40 and 60, three if between 60 and 80, and four if higher. The errors in the participant and automation’s judgments were calculated as the difference from the environmental criterion. A one-way analysis of variance (ANOVA) investigating the judgment errors for the participant’s initial judgments failed to show any significance with respect to the error level. This indicates that the participant did not have greater error with the particularly noisy trials. A similar result occurred for the participant’s joint judgments. However, the automation’s judgment performance was sensitive to the noise in the displayed speed (F4,175 = 5.240, p < 0.001). Although a Tukey’s pairwise comparisons failed to show that the automation had worse performance in the highest error level, the trend was in that direction. To test if the participant “ignored” the automation’s output when it had greater error, a one-way ANOVA investigated the difference between the joint judgment and the automation’s judgment as a function of the automation’s judgment error. The ANOVA was significant (F14,165 = 14.52, p < 0.001). A Tukey’s pairwise comparison failed to show any significant differences. However, there was a trend toward more of a difference when the automation had more error. Thus, the participant appeared to disregard the automation when its judgments were based on sensor inputs with large error due to noise. To see what the effect would be of adding the noise in the displayed speed to the cue set, a new lens model analysis was conducted for the participant’s final IL session. There was a slight improvement (i.e., reduction) in the unmodeled agreement C, meaning that the linear portion of the LME better accounted for the judgment achievement when it incorporated noise in the displayed speed (Table V).
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
TABLE V LENS MODEL PARAMETERS FOR PARTICIPANT 12 IN THE FINAL IL-PHASE SESSION—WITH AND WITHOUT ERROR IN DISPLAYED SPEED
Participant 12’s ability to predict the automation was as good as his individual judgment ability. His predictions suggest that he thought that his individual judgments were more similar to the automation than they really were (assumed similarity of 0.891 versus actual similarity of 0.547). Participant 13—Better at Predicting Traffic Conflicts Than at Predicting the Automation: Participant 13 trained in the nosimplification training condition and viewed the automation feedback when interacting with the automation during the IL phase. As shown in Fig. 6(d), Participant 13’s linear knowledge quickly reached a high level during the training phase, and his judgment achievement was limited by his ability to maintain cognitive control. However, when introduced to the automation feedback in the IL phase, he apparently changed his strategy, as seen in the drop in linear knowledge in the fifth session. His linear knowledge and cognitive control improved again in subsequent sessions. In the prediction phase, participant 13 was unique in his high judgment achievement and relatively lower predictive accuracy. This difference was largely due to his linear knowledge, which was lower at the prediction task (0.812) than in making individual judgments (0.956). A smaller contribution was apparent in his cognitive control for the prediction task (0.509), which was lower than his cognitive control for individual judgments (0.597). B. Results of Nomothetic Analysis To demonstrate the sensitivity of the HAJL parameters in identifying significant aggregate effects, a simplification training intervention and an automation feedback design intervention were tested. Linear mixed models with repeated-measure ANOVA analyses were conducted, where the repeated measurements were within the sessions. The fixed effects included gender, session, design condition, and training condition. Participants (nested within gender, training, and design condition) were modeled as random effects. The Wald test, calculated as the ratio of the parameter estimate to its standard error, was used to analyze the random effects’ covariance matrices [85]. 1) Training Phase: The task of predicting conflicts was difficult even for those in the simplification training condition. The participants’ judgment achievement was, in general, quite low; even by the fourth session, the mean was only 0.31 (Fig. 7). None of the fixed effects was significant at the 0.05 level, not even for training condition. However, the mean judgment achievement for the simplification condition was nominally higher than that for the no simplification condition (0.28 versus 0.26). As is common in judgment analysis studies [86],
771
participants were found to be a significant source of variation (Wald Z = 2.013, p = 0.044). 2) IL Phase: In the IL phase, the participants’ initial and joint judgments were compared with each other, with the automation’s judgments, and with the environmental criterion (Fig. 8). Overall, initial judgment achievement was quite low (mean = 0.208). This initial judgment achievement was lower than performance in the training phase for the no simplification training condition participants. As the environmental predictability in the IL phase was the same for the no simplification training condition in the training phase, this result was not expected. The achievement of the joint judgments was much higher (mean = 0.672) than the initial judgment achievement. This result indicates that the participants benefited from the automation’s information. The conflict measures (between the initial and automation’s judgments) were low, illustrating how different the participants’ initial judgments and automation’s judgments were. The low compromise measure (between the initial and subsequent joint judgments) and high adaptation to the automation’s judgments also demonstrated that the participants changed their judgments after seeing the automation’s information. No fixed effects were found in the achievement of either the initial or the joint judgments. A significant design effect was found in the conflict measure (F1,9.094 = 5.719, p = 0.040); a higher conflict measure (less conflict) was found with the outcome only design. Likewise, a significant design effect was found in the compromise measure (F1,9.315 = 5.611, p = 0.041), with less compromise shown in the outcome only design condition. These results indicate that, with the outcome feedback, participants had initial judgments closer to the automation’s and therefore could achieve the same performance without having to make great changes in the joint judgment. This result was not expected, as those with the automation’s detailed feedback were expected to learn more from the automation and therefore have less conflict and need for compromise. In the adaptation measure, a significant session effect was found (F3,12.767 = 6.063, p = 0.008), identifying greater adaptation over time. A trend was found for participants being a significant source of variation for initial judgments (Wald Z = 1.754, p = 0.079), joint judgments (Wald Z = 1.713, p = 0.087), conflict (Wald Z = 1.667, p = 0.093), compromise (Wald Z = 1.919, p = 0.055), and adaptation (Wald Z = 1.732, p = 0.083). 3) Prediction Phase: In each prediction phase trial, participants provided their probability of conflict judgment and a prediction of the automation, where their predictive accuracy is the correspondence between their prediction and the automation’s judgment. In general, the achievement of participants’ probability of conflict judgments tracked their predictive accuracy closely (Fig. 9). A Spearman correlation comparing the two sets of measures was significant at the 0.001 level. Assumed similarity is the correspondence between participants’ probability of conflict judgment and their prediction of the automation. In general, participants assumed that their judgments were similar to the automation (mean = 0.779). Actual similarity is the correspondence between participants’ prediction and the actual automation’s judgment. In contrast to
772
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
Fig. 7. Mean with 95% confidence intervals of probability of conflict-judgment achievement in the training phase.
Fig. 8. Mean with 95% confidence intervals of IL-phase measures (y-axis defined in the legend).
their assumed similarity, actual similarity tended to be quite low (mean = 0.144). A Spearman correlation comparing the two sets of measures was not significant. This result demonstrates that participants’ judgment behaviors were not well calibrated with respect to the automation’s judgments. In this phase, the participants completed one session of trials, establishing only one correlation measure per participant. Therefore, a simplified general linear model ANOVA analysis of the transformed achievement values was completed without examining for a participant effect. A main effect in the probability of conflict judgment achievement was found between the design conditions (F1,9 = 5.106, p = 0.050). Surprisingly, participants experiencing the outcome-only design in the previous IL phase performed better than those with automation feedback. The same design effect was found in the predictive accuracy measure (F1,9 = 12.37, p = 0.007). No main effects were found in the assumed similarity measure. In
the actual similarity measure, there was a trend for a design effect (F1,9 = 4.95, p = 0.053), with the outcome only design corresponding to higher actual similarity. These results indicate that the display of probability contours and the future position estimates did not help participants as expected. V. C ONCLUSION While a thorough validation of the HAJL methodology would require several demonstrations spanning several domains, types of information analysis automation, and types of judgments, this paper has provided a first step by fully exercising HAJL in a judgment task that is realistic (it is representative of future air traffic control environments) and requires interaction with information analysis automation. The results of this experiment demonstrate HAJL’s ability to not only capture individual judgment achievement, interaction with
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
Fig. 9.
773
Mean with 95% confidence intervals of prediction-phase measures (y-axis defined in the legend).
automation, and understanding of the automation’s judgments but also to identify the mechanisms underlying these performance measures: cognitive control, knowledge, conflict, compromise, adaptation, actual similarity, and assumed similarity. Analysis of HAJL’s measures includes capabilities for an idiographic examination of each participant’s judgments. Herein, this idiographic assessment found substantially different patterns of behavior, ranging from very poor unaided judgment masked by reliance on the automation (participant 6) to unaided individual-judgment achievement complemented by a sophisticated strategy for selectively incorporating the automation’s judgment (participant 12). HAJL’s measures also allow all of the following contributors to each participant’s achievement to be analyzed: reliance on the automation, understanding of it, levels of cognitive control, levels of linear knowledge, and the cues upon which judgments are based. An advantage of using HAJL’s measures is that interventions specific to each participant can be identified that addresses individual weaknesses (e.g., the poor individual judgment performance of participant 6). Similarly, the measures can help identify individual’s strengths and can help capture their specific expertise (e.g., participant 12’s strategy for accounting for noise in the automation’s judgment). Combining HAJL measures (including those generated using the LME) allows for the assessment of the extent to which poor cognitive control or knowledge may have limited performance. HAJL provides measures of a participant’s interaction with automation during the IL phase and of a participant’s understanding of the automation during the prediction phase. These measures could be used to enhance participant performance by showing them directly to the participants or by using them as the basis for training interventions tailored to each participant’s needs (see [36] for such model-based interventions). When using HAJL for idiographic analysis, a number of variants in the methodology may be introduced. In this paper, participants completed a specified set of trials. However,
other experimental designs could consider having participants continue making judgments until their respective measures of performance (such as individual judgment achievement in the training phase, joint judgment achievement in the IL phase, and predictive accuracy in the prediction phase) have stabilized or meet some minimum criteria. Ultimately, a subset of the HAJL phases may be conducted (and the corresponding HAJL measures collected) as part of larger training programs (or during real operations) and analyzed automatically in order to identify problematic behaviors requiring intervention. HAJL may also serve to guide research and design of training programs, automation, and displays intended to inform judgment performance. HAJL provides direct measures of judgment and interaction with automation with sufficient sensitivity to discern differences between participants, differences between main effects, and changes over time. Given a representative sample of participants, nomothetic analysis of these measures can provide assessments of the overall effect of interventions on a population. In this paper, HAJL identified several patterns of judgment behavior of interest to designers of information analysis automation. First, participants with higher probability of conflict judgment achievement tended to have higher predictive accuracy. Second, participants generally thought that their individual judgments were closer to the automation than they really were, as shown by the often substantial differences between assumed and actual similarities. These patterns may share an underlying cause: If participants predicted the automation’s judgments to be close to their own, then higher performing individual judgments would result in predictions closer to the automation’s high achievement judgments. Nomothetic analysis of participant behavior, in this paper, investigated training and automation feedback design interventions. Of note, the participants who viewed the automation feedback did not adapt to the automation more, learn to improve their individual judgment achievement over time, or predict
774
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
the automation’s judgments. Instead, they showed more conflict between their individual judgments and the automation’s judgments, had poorer individual judgment achievement, and had poorer predictive accuracy in the prediction phase. The automation feedback was intended to provide participants with better information. However, these results suggest that participants may have used the feedback passively, i.e., as an aid that justified greater compromise to the automation rather than for learning from and about the automation. This may have been due to the nature of the feedback that the participants viewed. It is possible that the probability contours were too confusing or that the participants needed more explanation in order to use them effectively. This may have also been due to insufficient training. Participants did not maintain their performance from the training phase into the IL phase, as evidenced by the low initial judgment achievement in the IL phase. Despite the ineffective nature of the displays and training, the results illustrate the insight that HAJL’s detailed measures are capable of eliciting for the many factors that contribute to effective human–automation interaction. Given that no other methodology is capable of providing such insight into all of these factors, and HAJL’s ability to highlight both individual and group behaviors, HAJL will be extremely useful for the design and evaluation of human–automation interaction systems as well as for research into the nature of human–automation interaction. ACKNOWLEDGMENT The authors would like to thank the students who participated in the experimental trials. They would also like to thank the three anonymous reviewers and M. L. Bolton who provided substantive suggestions for improving this paper. R EFERENCES [1] R. Parasuraman, T. B. Sheridan, and C. D. Wickens, “A model for types and levels of human interaction with automation,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 30, no. 3, pp. 286–296, May 2000. [2] J. K. Kuchar, “Methodology for alerting-system performance evaluation,” J. Guid. Control Dyn., vol. 19, no. 2, pp. 438–444, Mar./Apr. 1996. [3] T. S. Newman and A. K. Jain, “A survey of automated visual inspection,” Comput. Vis. Image Underst., vol. 67, no. 2, pp. 231–262, Mar. 1995. [4] E. J. Bass, D. J. Castaño, W. M. Jones, and S. T. Ernst-Fortin, “The effect of providing automated clear air turbulence assessments to commercial airline pilots,” Int. J. Aviat. Psychol., vol. 11, no. 4, pp. 317–339, Jan. 2001. [5] E. J. Bass, S. T. Ernst-Fortin, R. L Small, and J. T. Hogans, Jr., “Architecture and development environment of a knowledge-based monitor that facilitate incremental knowledge-base development,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 34, no. 4, pp. 441–449, Jul. 2004. [6] A. R. Pritchett, “Reviewing the role of cockpit alerting systems,” Hum. Factors Aerosp. Saf., vol. 1, no. 1, pp. 5–38, 2001. [7] F. J. Seagull and P. M. Sanderson, “Anesthesia alarms in context: An observational study,” Hum. Factors, vol. 43, no. 1, pp. 66–78, 2001. [8] T. A. Dingus, D. V. McGehee, N. Manakkal, S. K. Jahns, C. Carney, and J. M. Hankey, “Human factors field evaluation of automotive headway maintenance/collision warning devices,” Hum. Factors, vol. 39, no. 2, pp. 216–229, Jun. 1997. [9] Y. Seong and A. M. Bisantz, “Judgment and trust in conjunction with automated decision aids: A theoretical model and empirical investigation,” in Proc. 46th Annu. Meeting Human Factors Ergonom. Soc., 2002, pp. 423–427. [10] L. Bainbridge, “Ironies of automation,” in New Technology and Human Error, J. Rasmussen, K. Duncan, and J. Leplat, Eds. New York: Wiley, 1987, pp. 271–283.
[11] D. D. Woods, “The alarm problem and directed attention in dynamic fault management,” Ergonomics, vol. 38, no. 11, pp. 2371–2393, Nov. 1995. [12] R. Parasuraman and V. Riley, “Humans and automation: Use, misuse, disuse, abuse,” Hum. Factors, vol. 39, no. 2, pp. 230–253, Jun. 1997. [13] L. Adelman, M. Christian, J. Gualtieri, and K. L. Johnson, “Examining the effects of cognitive consistency between training and displays,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 28, no. 1, pp. 1–16, Jan. 1998. [14] K. L. Mosier, L. J. Skitka, S. Heers, and M. Burdick, “Automation bias: Decision making and performance in high-tech cockpits,” Int. J. Aviat. Psychol., vol. 8, no. 1, pp. 47–63, Jan. 1998. [15] R. Parasuraman, “Vigilance, monitoring, and search,” in Handbook of Perception and Human Performance, vol. 2, K. Boff, L. Kaufman, and J. Thomas, Eds. New York: Wiley, 1986, pp. 43.1–43.39. [16] E. L. Wiener and R. E. Curry, “Flight-deck automation: Promises and problems,” Ergonomics, vol. 23, no. 10, pp. 995–1011, 1980. [17] D. W. Corcoran, J. L. Dennett, and A. Carpenter, “Cooperation of listener and computer in a recognition task: II. Effects of computer reliability and ‘dependent’ versus ‘independent’ conditions,” J. Acoust. Soc. Amer., vol. 52, pt. 2, no. 6, pp. 1613–1619, Dec. 1972. [18] R. Parasuraman, R. Molloy, and I. L. Singh, “Performance consequences of automation-induced ‘complacency’,” Int. J. Aviat. Psychol., vol. 3, no. 1, pp. 1–23, Jan. 1993. [19] B. M. Muir and N. Moray, “Trust in automation: Part II. Experimental studies of trust and human intervention in a process control simulation,” Ergonomics, vol. 39, no. 3, pp. 429–460, Mar. 1996. [20] J. D. Lee and N. Moray, “Trust, control strategies and allocation of function in human–machine systems,” Ergonomics, vol. 35, no. 10, pp. 1243– 1270, Oct. 1992. [21] J. D. Lee and N. Moray, “Trust, self-confidence, and operators’ adaptation to automation,” Int. J. Human-Comput. Stud., vol. 40, no. 1, pp. 153–184, Jan. 1994. [22] I. L. Singh, R. Molloy, and R. Parasuraman, “Automationinduced ‘complacency’: Development of the complacency-potential rating scale,” Int. J. Aviat. Psychol., vol. 3, no. 2, pp. 111–122, Apr. 1993. [23] J. Orasanu and U. Fischer, “Finding decisions in natural environments: The view from the cockpit,” in Naturalistic Decision Making, C. E. Zsambok and G. Klein, Eds. Hillsdale, NJ: Lawrence Erlbaum, 1997, pp. 343–357. [24] E. M. Hickling, “Modern nuclear power plants: Alarm system design,” in Human Factors in Alarm Design, N. Stanton, Ed. London, U.K.: Taylor & Francis, 1994, pp. 165–178. [25] J. M. Noyes and A. R. Starr, “Civil aircraft warning systems: Future directions in information management and presentation,” Int. J. Aviat. Psychol., vol. 10, no. 2, pp. 169–188, Apr. 2000. [26] A. R. Pritchett and B. Vándor, “Designing situation displays to promote conformance to automatic alerts,” in Proc. Annu. Meeting Human Factors Ergonom. Soc., 2001, pp. 311–315. [27] C. D. Wickens, K. S. Gempler, and E. M. Morphew, “Workload and reliability of predictor displays in aircraft traffic avoidance,” Transp. Hum. Factors, vol. 2, no. 2, pp. 99–126, 2000. [28] A. R. Pritchett, “Pilot performance at collision avoidance during closely spaced parallel approaches,” Air Traffic Control Q., vol. 7, no. 1, pp. 47– 75, 1999. [29] A. Kirlik, “Modeling strategic behavior in human–automation interaction: Why an ‘aid’ can (and should) go unused,” Hum. Factors, vol. 35, no. 2, pp. 221–242, Jun. 1993. [30] A. D. Andre and H. Cutler, “Displaying uncertainty in advanced navigation systems,” in Proc. 42nd Annu. Meeting Human Factors Ergonom. Soc., 1998, pp. 31–35. [31] A. Degani and A. Kirlik, “Modes in human–automation interaction: Initial observations about a modeling approach,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 1995, pp. 3443–3450. [32] A. Degani, M. Shafto, and A. Kirlik, “Modes in human–machine systems: Review, classification, and application,” Int. J. Aviat. Psychol., vol. 9, no. 2, pp. 125–138, 1999. [33] A. Kirlik, “Requirements for psychological models to support design: Towards ecological task analysis,” in Global Perspectives on the Ecology of Human–Machine Systems, vol. 1, J. M. Flach, P. A. Hancock, J. K. Caird, and K. J. Vicente, Eds. Hillsdale, NJ: Lawrence Erlbaum, 1995, pp. 68–120. [34] K. J. Vicente and J. Rasmussen, “The ecology of human–machine systems II: Mediating ‘direct perception’ in complex work domains,” Ecol. Psychol., vol. 2, no. 3, pp. 207–249, 1990.
BASS AND PRITCHETT: HAJL: HUMAN-AUTOMATED JUDGE LEARNING
[35] D. D. Woods, “The cognitive engineering of problem representation,” in Human–Computer Interaction in Complex Systems, G. R. S. Weir and J. L. Alty, Eds. New York: Academic, 1991, pp. 169–188. [36] R. Strauss, “A methodology for measuring the judgmental components of situation awareness,” Ph.D. dissertation, Georgia Inst. Technol., Atlanta, GA, 2000. unpublished. [37] A. M. Bisantz and A. R. Pritchett, “Measuring judgment in complex, dynamic environments: A lens-model analysis of collision detection behavior,” Hum. Factors, vol. 45, pp. 266–280. [38] E. J. Bass, Human-automated Judgment Learning: A research paradigm based on Interpersonal Learning to investigate human interaction with automated judgments of hazards, 2002. Dissertation Abstracts International, 63 (03) B, UMI No. 3046873. [39] R. W. Cooksey, Judgment Analysis: Theory, Methods, and Application. New York: Academic, 1996. [40] K. R. Hammond, M. Wilkins, and F. J. Todd, “A research paradigm for the study of interpersonal learning,” Psychol. Bull., vol. 65, no. 4, pp. 221– 232, Apr. 1966. [41] T. C. Earle, “Interpersonal learning,” in Human Judgment and Social Interaction, L. Rappoport and D. A. Summers, Eds. New York: Holt, Rinehart and Winston, 1973, pp. 240–266. [42] K. R. Hammond, “The cognitive conflict paradigm,” in Human Judgment and Social Interaction, L. Rappoport and D. A. Summers, Eds. New York: Holt, Rinehart and Winston, 1973, pp. 188–205. [43] P. J. Smith, C. E. McCoy, and C. Layton, “Brittleness in the design of cooperative problem-solving systems: The effects on user performance,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 27, no. 3, pp. 360– 371, May 1997. [44] T. R. Stewart and C. M. Lusk, “Seven components of judgmental forecasting skill: Implications for research and improvement of forecasts,” J. Forecast., vol. 13, no. 7, pp. 579–599, Dec. 1994. [45] E. L. Wiener, “Human factors of advanced technology (Glass Cockpit) transport aircraft,” NASA-Ames Res. Center, Moffett Field, CA, NASA Contractor Report 177scp528, 1989. [46] M. S. Cohen, R. Parasuraman, and J. T. Freeman, “Trust in decision aids: A model and its training implications,” in Proc. Command Control Res. Technol. Symp., 1998, pp. 1–37. [47] A. J. Masalonis, “Effects of training operators on situation-specific automation reliability,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 2003, pp. 1595–1599. [48] E. J. Bass and A. R. Pritchett, “Human-automated judgment learning: A methodology to investigate human interaction with automated judges,” in Proc. 46th Annu. Meeting Human Factors Ergon. Soc., Baltimore, MD, Sep. 30–Oct. 4 2002, pp. 362–366. [49] E. J. Bass and A. R. Pritchett, “Human-automated judgment learning: Enhancing interaction with automated judgment systems,” in Working with Technology in Mind, A. Kirlik, Ed. London, U.K.: Oxford Univ. Press, 2006, pp. 114–126. [50] B. M. Muir, “Trust in automation: Part I. Theoretical issues in the study of trust and human intervention in automated systems,” Ergonomics, vol. 37, no. 11, pp. 1905–1922, 1994. [51] B. Brehmer, “Social judgment theory and the analysis of interpersonal conflict,” Psychol. Bull., vol. 83, pp. 985–1003, 1976. [52] L. I. Dalgleish, “Decision making in child abuse cases: Application of social judgment theory and signal detection theory,” in Human Judgment: The SJT View, B. Brehmer and C. R. B. Joyce, Eds. Amsterdam, The Netherlands: North-Holland, 1988, pp. 317–360. [53] R. S. Wigton, “Applications of judgment analysis and cognitive feedback to medicine,” in Human Judgment: The SJT View. Advances in Psychology, vol. 54, B. Brehmer and C. R. B. Joyce, Eds. Amsterdam, The Netherlands: North-Holland, 1988, pp. 427–442. [54] T. R. Stewart, “A decomposition of the correlation coefficient and its use in analyzing forecasting skill,” Weather Forecast., vol. 5, no. 4, pp. 661– 666, Dec. 1990. [55] T. R. Stewart, K. F. Heideman, W. R. Moniger, and P. Reagan-Cirincione, “Effects of improved information on the components of skill in weather forecasting,” Org. Behav. Human Decis. Process., vol. 53, no. 2, pp. 107– 134, Nov. 1992. [56] R. W. Cooksey and P. Freebody, “Cue subset contributions in the hierarchical multivariate lens model: Judgments of children’s reading achievement,” Org. Behav. Human Decis. Process., vol. 39, no. 1, pp. 115–132, Feb. 1987. [57] R. W. Cooksey, P. Freebody, and G. R. Davidson, “Teacher’s prediction of children’s early reading achievement: An application of social judgment theory,” Amer. Educ. Res. J., vol. 23, no. 1, pp. 41–64, 1986. [58] K. R. Hammond, T. R. Stewart, B. Brehmer, and D. O. Steinmann, “Social judgement theory,” in Human Judgment and Decision Processes,
775
[59] [60] [61] [62] [63]
[64] [65] [66] [67] [68] [69] [70]
[71]
[72] [73] [74] [75]
[76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86]
M. F. Kaplan and S. Schwartz, Eds. New York: Academic, 1975, pp. 272–312. E. Brunswik, “The conceptual framework of psychology,” in International Encyclopedia of Unified Science, vol. 1. Chicago, IL: Univ. Chicago Press, 1952. E. Brunswik, Perception and the Representative Design of Psychological Experiments. Berkeley, CA: Univ. California Press, 1956. H. J. Einhorn, “The use of nonlinear, noncompensatory models in decision making,” Psychol. Bull., vol. 73, pp. 221–230, 1970. G. E. Campbell, W. L. Buff, and A. E. Bolton, “The diagnostic utility of fuzzy system modeling for application in training systems,” in Proc. 44th Annu. Meeting Human Factors Ergonom. Soc., 2000, pp. 370–373. L. Rothrock and A. Kirlik, “Inferring rule-based strategies in dynamic judgment tasks: Toward a noncompensatory formulation of the lens model,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 33, no. 1, pp. 58–72, Jan. 2003. L. R. Goldberg, “Simple models or simple processes? Some research on clinical judgments,” Amer. Psychol., vol. 23, no. 7, pp. 483–496, Jul. 1968. R. M. Dawes, “The robust beauty of improper linear models in decision making,” Amer. Psychol., vol. 34, no. 7, pp. 571–582, 1979. H. J. Einhorn, D. N. Kleinmuntz, and B. Kleinmuntz, “Linear regression and process-tracing models of judgment,” Psychol. Rev., vol. 86, no. 5, pp. 465–485, Sep. 1979. C. J. Hursch, K. R. Hammond, and J. L. Hursch, “Some methodological considerations in multiple-cue probability,” Psychol. Rev., vol. 71, no. 1, pp. 42–60, Jan. 1964. L. R. Tucker, “A suggested alternative formulation in the developments by Hursch, Hammond, and Hursch, and by Hammond, Hursch and Todd,” Psychol. Rev., vol. 71, pp. 528–530, 1964. W. K. Balzer, M. E. Doherty, and R. O’Connor, “Effects of cognitive feedback on performance,” Psychol. Bull., vol. 106, no. 3, pp. 410–433, 1989. A. M. Bisantz, A. Kirlik, P. Gay, D. A. Phipps, N. Walker, and A. D. Fisk, “Modeling and analysis of a dynamic judgment task using a lens model approach,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 30, no. 6, pp. 1–12, Nov. 2000. E. J. Bass, P. Kvam, R. H. Campbell, D. J. Castaño, and W. M. Jones, “Measuring the seat of the pants: Commercial airline pilot turbulence assessments in a full-motion simulator,” Int. J. Aviat. Psychol., vol. 12, no. 2, pp. 123–136, Apr. 2001. B. Brehmer, “Effects of task predictability and cue validity on interpersonal learning of inference tasks involving both linear and nonlinear relations,” Organ. Behav. Hum. Perform., vol. 10, pp. 24–46, 1973. J. Mumpower and K. R. Hammond, “Entangled task dimensions: An impediment to interpersonal learning,” Organ. Behav. Hum. Perform., vol. 11, pp. 377–389, 1974. B. Brehmer and K. R. Hammond, “Cognitive sources of interpersonal conflict: Analysis of interactions between linear and nonlinear cognitive systems,” Organ. Behav. Hum. Perform., vol. 10, pp. 290–313, 1973. M. Helenius, “Socially induced cognitive conflict: A study of disagreement over childrearing policies,” in Human Judgment and Social Interaction, L. Rappoport and D. A. Summers, Eds. New York: Holt, Rinehart and Winston, 1973, pp. 208–217. J. Smedslund, Multiple-Probability Learning: An Inquiry Into the Origins of Perception. Oslo, Norway: Oslo Univ. Press, 1955. K. R. Hammond, C. J. Hursch, and F. J. Todd, “Analyzing the components of clinical inference,” Psychol. Rev., vol. 71, pp. 438–456, Nov. 1964. D. C. Wightman and G. Lintern, “Part-task training for tracking and manual control,” Hum. Factors, vol. 27, pp. 267–283, 1985. RTCA, “Final Report of RTCA Task Force 3: Free Flight Implementation,” RTCA, Inc., Washington, DC, 1995. L. Yang and J. K. Kuchar, “Prototype conflict alerting logic for free flight,” J. Guid. Control Dyn., vol. 20, no. 4, pp. 768–773, Jul./Aug. 1997. A. R. Pritchett and C. A. Ippolito, “Software architecture for a reconfigurable flight simulator,” in Proc. AIAA Model. Simul. Technol. Conf., 2000, pp. 1–10. [Online]. Available: http://www.aiaa.org/store J. D. Smith, S. R. Ellis, and E. C. Lee, “Perceived threat and avoidance maneuvers in response to cockpit traffic displays,” Hum. Factors, vol. 26, pp. 33–48, 1984. G. E. P. Box, W. G. Hunter, and J. S. Hunter, Statistics for Experiments. New York: Wiley, 1978. D. Kahneman, P. Slovic, and A. Tversky, Judgment under Uncertainty: Heuristics and Biases. Cambridge, U.K.: Cambridge Univ. Press, 1982. R. I. Jennrich and M. D. Schluchter, “Unbalanced repeated-measures models with structured covariance matrices,” Biometrics, vol. 42, no. 4, pp. 805–820, Dec. 1986. B. Brehmer, “The psychology of linear judgment models,” Acta Psychol., vol. 87, pp. 137–154, 1994.
776
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 38, NO. 4, JULY 2008
Ellen J. Bass received the B.S. Eng. and B.S. Econ. degrees from the University of Pennsylvania, Philadelphia, the M.S. degree from the State University of New York at Binghamton, and the Ph.D. degree from the Georgia Institute of Technology, Atlanta. She is currently an Assistant Professor of systems engineering with the Department of Systems and Information Engineering, University of Virginia, Charlottesville. She has 25 years of industry and research experience in human-centered systems engineering in the domains of aviation, meteorology, bioinformatics, and clinical informatics. Her research focuses on modeling human judgment and decision making in dynamic domains in order to inform the design of decision support systems.
Amy R. Pritchett received the B.S., M.S., and D.Sc. degrees in aeronautics and astronautics from the Massachusetts Institute of Technology Cambridge. She is currently an Associate Professor and the David S. Lewis Associate Professor of cognitive engineering in the Schools of Aerospace Engineering and Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, respectively. Her research focuses on the design to support human cognitive performance in aviation and education. Dr. Pritchett was awarded the 2007 AIAA Lawrence Sperry Award for top young aerospace engineer for her aviation applications of cognitive engineering.