Measurement System

Report 4 Downloads 238 Views
Constructing Validity Using Validity Centered Design James B. Olsen, Alpine Testing Solutions C. Victor Bunderson, Edumetrics Institute

Presentation Outline 1. 2. 3.

4. 5.

6. 7. 8.

Historical Roots Traditional Test Development Process Extensions for Computerized & Online Testing Performance Work Models vs. Objectives Computerized Educational Measurement Taxomomies for Learning and Performance Validity Centered Design Conclusions and Future Recommendations

Historical Roots E. F. Lindquist Considerations in Objective Test Construction (1951) “For the present, it seems best to attempt to incorporate in the achievement test situation as much as possible of the same complexity that characterizes the criterion situation…In such tests the most important consideration is that the test questions require the examinee to do the same things, however complex, that he is required to do in the criterion situations.” (Italics in original)

Historical Roots E. F. Lindquist (1969) The Impact of Machines on Educational Measurement “During these years [1949-1969] there has been little or no experimentation with, or successful development of, new and improved types of tests measuring hitherto unmeasured objectives…[Test developers] have accepted restrictions upon test development that may definitely have prevented the use of, or even the search for, new, and possibly improved types of test exercises…If he could not find an item type well adapted to a specific purpose, he was free to use his ingenuity to invent or develop new and improved types of test exercises to serve that purpose.”

Historical Roots “Over the next decade or two, computer and audiovisual technology will dramatically change the way individuals learn as well as the way they work. Technology will also have a profound impact on the ways in which knowledge, aptitudes, competencies, and personal qualities are assessed and even conceptualized.” 1. New and more varied interactive delivery systems 2. Heightened individuality in learning and thinking 3. Increased premium on adaptive learning 4. Heightened emphasis on individuality in assessment 5. Increased premium on adaptive measurement, dynamic measurement Samuel Messick (1988) The once and future issues of validity: Assessing the meaning and consequences of measurement.

Constructing Validity Throughout The Test Development Process 1. 2.

3.

4. 5. 6.

7.

Test Planning Content/Task Definition Job/Practice Analysis Test Specification Item Development Test Design and Assembly Test Production

8. 9. 10.

11. 12. 13.

Test Administration Test Scoring Performance Standard Setting Reporting Results Item Banking Technical Reports/Validation

Instruction/Assessment: A Cyclical Design Process

Key Test Score Validity Questions 





What complex of Knowledge, Skills and Abilities should be assessed? What behaviors or performances will reveal the constructs and skills to be tested? What tasks or situations should elicit those behaviors? (Messick, 1994)

Extensions for Computerized and Online Testing  Create and evaluate testable theories of the content domain.  Conduct both Job/Task Analysis and Job/Task Synthesis.  

Content Domain Analysis Domain Modeling

 Create knowledge and skill competency models and order these competencies by content categories (content levels) and substantive processes (thinking skills).

Content Domain Analysis and Modeling 



Domain Analysis: Determine required dimensions and ordering of content, thinking skills and knowledge and performance tasks within and across dimensions Domain and Work Modeling: Create models of realistic job-like performance situations

Domain Theory The term “domain theory” in educational measurement was used in a comprehensive sense by Messick (1995). “A major goal of domain theory is to understand the construct-relevant sources of task difficulty, which then serve as a guide to the rational development and scoring of performance tasks and other assessment formats. At whatever stage of its development, then, domain theory is a primary basis for specifying the boundaries and structure of the construct to be assessed.” (Italics added by author)

Performance Work Models vs. Objectives While instructional objectives have provided a cornerstone for the practice and science of instruction, they have also locked us into a lexically based conceptual system…A performance work model is a integrated unit of practice with one or more elements of knowledge and skill that allows replication of both information and interactions. (Bunderson, Gibbons, Olsen and Kearsley, 1981)

Types of Performance Work Models Work models provide settings in which the learner can 1) converse using new vocabulary and concepts, 2) perform new procedures, or 3) make predictions and solve new problems. These settings should have visible results so that the learner, teacher or other individuals can obtain information for judging the success or failure of the performances.

Performance Work Model Synthesis Performance Work Model synthesis discussed in two technical papers and a book chapter: 





Bunderson, C.V., Gibbons, A.S., Olsen, J.B. and Kearsley, G.S. (1981) Work models: Beyond Instructional Objectives. Instructional Science. 10:205-215. Gibbons, A.S., Bunderson, C.V., Olsen, J.B. and Robertson, J. (1995) Work Models: Still beyond instructional objectives. Machine Mediated Learning. 5(3&4): 221-236. Gibbons, A.S. and Fairweather, P.G (1998) Instructional Strategy III: Fragmentation and Integration Chapter 15 in Gibbons and Fairweather. Computer-Based Instruction: Design and Development. Englewood Cliffs, NJ: Educational Technology Publications, 278-296.

Expand the Test Blueprint Build

a interactive test specification blueprint that can be shared internationally and modified by any SME Include exemplary and multiple item types that SMEs deem useful for each major content division Use creativity and rapid prototyping to select and display relevant assessment situations and item types

Expand the Item Banking System Focus

on Item Ideas before Item Writing Write and store items in an item/task/simulation/rubric banking system. Generate and critique item ideas from multiple perspectives before you start to write the item or task. Extend item bank to include multimedia components

Item Idea Visualization Process

Expand the Test Review Dynamic review by SMEs of test items and tasks within the test delivery system Use interactive webconferencing system using a WYSIWYG display

Item accuracy Item relevancy Item importance Item grammar, syntax Item Bias

Scientific Research on Validity, Reliability, Fairness Create

Testable domain theories that can be verified and validated. Collect ongoing evidence o exam validity, reliability and fairness. Continually improve exam and items based on the evidence collected.

Test analysis Item analysis IRT Calibrations Differential item functioning Suggested revisions

Explore alternative scoring algorithms for complex items and tasks   





Dichotomous Scoring Polytomous Scoring Partial Credit Scoring Weighted Items, Options, Logical Answer Expressions, Subscores Scoring on multiple dimensions

Explore New Testing and Measurement Models   

  

 

Computerized Mastery Testing Computerized Adaptive Testing Testlet Adaptive Testing Linear on the Fly Testing Multistage Testing Decision Theoretic Testing Performance Testing Blended Testing Approaches

Computerized Educational Measurement (Bunderson, Inouye and Olsen, 1989)  

 

Generation 1: Computerized Testing traditional tests converted for administration by computer. Generation 2: Computerized Adaptive Testing with adaptation with item difficulty, discrimination or item time. Generation 3: Continuous Measurement with instruction closely integrated with measured assessment. Generation 4: Intelligent Measurement with adaptation to individual learner profiles, expertise, and aptitude/trait complexes (e.g., Lee Cronbach, Richard Snow, and recently Phillip Ackerman) and to job performance or competence levels.

Computerized Educational Measurement (Bunderson, Inouye and Olsen, 1989)  

 



Test Security is a natural driver between the Generation 1 and 2 Measurement of competence is a natural driver from Generation 2 to 3 Measurement of generated work models is also a natural driver from Generation 2 to 3. Dynamic simulations can be viewed as a 3rd or 4th Generation form of adapting the display and the processing level in response to changes in the simulated system. Movement from academic items to performance testing is another significant branch of Generation 4.

Computers in Educational Assessment (U.S. Congress, OTA, Bunderson, Olsen and Greenberg, 1990) 1.

2.

3.

4.

Increase the frequency and variety of help systems [progress assessment] compared to high stakes assessment. Greatly increase the frequency of formative assessment. Increase the use of human judgment and measures of more complex, integrated, and strategic objectives. Foster use of new item types and portable assessment devices

5.

6.

7.

8.

Create infrastructure of integrated learning and assessment systems. Encourage development of professional skills in integrating assessment and instruction. Stimulate investment in improving technologybased assessment practice. Maintain high professional standards as the field evolves.

Integration of Instruction and Assessment “In the highly individualized systems of computerbased and computer-monitored instruction of the future, it will be almost impossible to distinguish the testing materials from the teaching materials.” E.F. Lindquist, 1969 p. 368) “Make measurement do a better job of facilitating learning for all individuals” (Robert Linn, 1989, p. 9) “NOT as a process quite apart from instruction, but an integral part of it.” (Ralph Tyler, 1951, p. 47)

Revised Bloom’s Taxonomy for Teaching and Learning (2001)

Performance Testing Design and Development Taxonomy Level of Fidelity

Knowledge Based Testing

Item Types

Performanc e Based Testing

• Multiple Choice • Drag and Drop • Hot Area • Fill-in-the-Blank

Supporting Practice Analysis Techniques

Performanc e Testing

• Simulations • Emulations • Live Application • Virtual Reality • Board Level

• Scenarios • Case Study • Essay

Job/Task Analysis (JTA)

Outcome/ Competency Analysis

Bloom’s Taxonomy Knowl edg e

Compre hens ion

Applic ation

Analysis

Synthesis

Evaluati on

Gagne’s Learning Outcomes Correlated to Learning Theories and Mental Models

Verbal Informati on

Classify Co ncepts

Problem So lvin g

Cogn itive Strategy

Guilford’s Mental Processes Memory

Cogn ition

Conver gent Prod uction

Diverg ent Producti on

Evaluati on

Revised Bloom’s Taxonomy (Anderson & Krathwohl)

PTC Design & Delivery Focus Area

Validity Centered Design 



A validity model for enhancing licensure and certification test constructs and consequential validity. The validity centered design model that can be employed with selected response items, constructed response items and performance items.

Validity Centered Design (Bunderson, C.V. 2003; Olsen, J. 2005) 





I. Design for Usability, Appeal, Positive Expectations (User-centered design) 1. Overall Appeal, Relevance 2. Usability 3. Value & positive consequences (Perceived) II. Design for Inherent Construct Validity 4. Content 5. Thinking processes 6. Structure (number of dimensions) III. Design for Criterion-Related Validity 7. Generalizability 8. External convergent/ 9. Consequential (+/- ) discriminant

Validity-Centered Design 





VCD creates a domain theory of measurable progressive attainments (competencies) along one or more measured scales of progress The design side of VCD uses rapid prototyping continuing into well-evaluated cycles of Plan Implement Evaluate  Revise. It is not an sequential Analysis  Design  Develop  Implement  Evaluate model, but a cyclically improving model.

Five Subsystems of Learning Progress Systems 1.

2.

3.

4. 5.

The Learning Progress Map or interpretive framework The Measurement System to track and post progress on the Map The Instructional System – both on-line and blended aspects The Evaluation Plan and Process The Implementation Plan and Process

What problem are we solving? Compare Two Test Information Curves High

High

Cut

Certification or Mastery Low

Low

Test Info

Learning Progress Multiple Purposes Test Info

Student Assessment vs. Domain Measurement High

Student assessment event

Low

Measuring progress in a domain for multiple purposes: (using continuous data) • assessing moving progress • basis for feedback to learners • measures to evaluate design alternatives - instruction, adapting to individuals, implementation plans (takes on challenge of building unidimensional scales)

What products result?  Validity Centered Design  Continuous progress measurement with immediate feedback and instruction, and cycles of system improvement based on principled design experiments.  Learning Progress Systems

Score Interpretation  Validity Centered Design - emphasis is placed on developing an interpretable framework for students and teachers -- across the entire substantive content or proficiency domain.  Requires construction of essentially unidimensional scales  Requires considerable interface design work to make it appealing, easy to use, clearly interpretable, and with useful navigation features.

Approaches to validity. Consider the measurement, interpretation, action cycle

Observe; Compare

Measure

Interpret

Take Action

Qualitative Study

It’s not the measurement instrument that is valid or invalid; It’s the inferences made from it (interpretation & action)

Need to add other elements for a complete validity argument

Evidential Basis

Consequential Basis

Test Interpretation

Test Use

Construct Validity

Construct Validity +

Construct Validity

Value Implications

Relevance/Utility Construct Validity Relevance/Utility Value Implications Social Consequences

“Constructing Construct Validity” (Messick 1998)

Validity Centered Design Validity centered design aspires to do more than guide and design an assessment system. It also includes a measurement system. Validity centered design may be used for multiple ongoing, cyclical evaluations  of not only the students,  but also the measurement system itself,  the instructional materials delivered on the same computers as the measurement system,  the adaptive research system that includes adaptation to individual differences, and the  strategic implementation of any learning and content management systems.

Virtual Laboratory Performance Tests 

Virtual Laboratory Demonstration Open above hyperlink Login: sas01 Password: welcome Click checkmark to submit

Conclusions and Recommendations 





Conduct a job or practice analysis to effectively ground the assessment. Use an analysis process to define the tasks and subtasks followed by a synthesis process to create meaningful clusters or aggregates of the tasks and subtasks. Organize these meaningful clusters of tasks and subtasks in work models or performance models.

Conclusions and Recommendations 





Determine the most appropriate types of content display representation including media, display and response information. Ensure high fidelity to the thinking processes and content domain elements required in the work or performance models. Develop a test blueprint or specification that identifies the content domain dimensions and elements as well as the thinking processes or level of cognitive demand

Conclusions and Recommendations 



Select and determine appropriate balance between item and task types. The costs and benefits of using each item or task type should be investigated. Include the item and test types based on this reasoned analysis The assessment designer should create sufficient test items of each required item type and pilot test the items with a representative sample of individuals from the target population for the assessment.

Conclusions and Recommendations 







Analyze item and test results from the pilot tests. Select specific items, tasks, and scoring criteria that should be used for the operational assessment. Design and validate scoring models for exam, subtests, tasks, items. Set a performance standard or proficiency levels for the exam. Conduct ongoing psychometric analysis of the items and tests

Conclusions and Recommendations 





Conduct ongoing validity, reliability, fairness, and usability analyses of exam scores and interpretations. Investigate the value of evidence centered design and validity centered design approaches. Develop a plan and operational procedures to ensure appropriate security for the test items, tasks, scoring criteria and examinee results.

Final Comment “Science [validity] is an ongoing subject with new research always modifying old ideas.” Richard Feynman, letter to Sandor Solt April 25, 1969