Constructing Validity Using Validity Centered Design James B. Olsen, Alpine Testing Solutions C. Victor Bunderson, Edumetrics Institute
Presentation Outline 1. 2. 3.
4. 5.
6. 7. 8.
Historical Roots Traditional Test Development Process Extensions for Computerized & Online Testing Performance Work Models vs. Objectives Computerized Educational Measurement Taxomomies for Learning and Performance Validity Centered Design Conclusions and Future Recommendations
Historical Roots E. F. Lindquist Considerations in Objective Test Construction (1951) “For the present, it seems best to attempt to incorporate in the achievement test situation as much as possible of the same complexity that characterizes the criterion situation…In such tests the most important consideration is that the test questions require the examinee to do the same things, however complex, that he is required to do in the criterion situations.” (Italics in original)
Historical Roots E. F. Lindquist (1969) The Impact of Machines on Educational Measurement “During these years [1949-1969] there has been little or no experimentation with, or successful development of, new and improved types of tests measuring hitherto unmeasured objectives…[Test developers] have accepted restrictions upon test development that may definitely have prevented the use of, or even the search for, new, and possibly improved types of test exercises…If he could not find an item type well adapted to a specific purpose, he was free to use his ingenuity to invent or develop new and improved types of test exercises to serve that purpose.”
Historical Roots “Over the next decade or two, computer and audiovisual technology will dramatically change the way individuals learn as well as the way they work. Technology will also have a profound impact on the ways in which knowledge, aptitudes, competencies, and personal qualities are assessed and even conceptualized.” 1. New and more varied interactive delivery systems 2. Heightened individuality in learning and thinking 3. Increased premium on adaptive learning 4. Heightened emphasis on individuality in assessment 5. Increased premium on adaptive measurement, dynamic measurement Samuel Messick (1988) The once and future issues of validity: Assessing the meaning and consequences of measurement.
Constructing Validity Throughout The Test Development Process 1. 2.
3.
4. 5. 6.
7.
Test Planning Content/Task Definition Job/Practice Analysis Test Specification Item Development Test Design and Assembly Test Production
8. 9. 10.
11. 12. 13.
Test Administration Test Scoring Performance Standard Setting Reporting Results Item Banking Technical Reports/Validation
Instruction/Assessment: A Cyclical Design Process
Key Test Score Validity Questions
What complex of Knowledge, Skills and Abilities should be assessed? What behaviors or performances will reveal the constructs and skills to be tested? What tasks or situations should elicit those behaviors? (Messick, 1994)
Extensions for Computerized and Online Testing Create and evaluate testable theories of the content domain. Conduct both Job/Task Analysis and Job/Task Synthesis.
Content Domain Analysis Domain Modeling
Create knowledge and skill competency models and order these competencies by content categories (content levels) and substantive processes (thinking skills).
Content Domain Analysis and Modeling
Domain Analysis: Determine required dimensions and ordering of content, thinking skills and knowledge and performance tasks within and across dimensions Domain and Work Modeling: Create models of realistic job-like performance situations
Domain Theory The term “domain theory” in educational measurement was used in a comprehensive sense by Messick (1995). “A major goal of domain theory is to understand the construct-relevant sources of task difficulty, which then serve as a guide to the rational development and scoring of performance tasks and other assessment formats. At whatever stage of its development, then, domain theory is a primary basis for specifying the boundaries and structure of the construct to be assessed.” (Italics added by author)
Performance Work Models vs. Objectives While instructional objectives have provided a cornerstone for the practice and science of instruction, they have also locked us into a lexically based conceptual system…A performance work model is a integrated unit of practice with one or more elements of knowledge and skill that allows replication of both information and interactions. (Bunderson, Gibbons, Olsen and Kearsley, 1981)
Types of Performance Work Models Work models provide settings in which the learner can 1) converse using new vocabulary and concepts, 2) perform new procedures, or 3) make predictions and solve new problems. These settings should have visible results so that the learner, teacher or other individuals can obtain information for judging the success or failure of the performances.
Performance Work Model Synthesis Performance Work Model synthesis discussed in two technical papers and a book chapter:
Bunderson, C.V., Gibbons, A.S., Olsen, J.B. and Kearsley, G.S. (1981) Work models: Beyond Instructional Objectives. Instructional Science. 10:205-215. Gibbons, A.S., Bunderson, C.V., Olsen, J.B. and Robertson, J. (1995) Work Models: Still beyond instructional objectives. Machine Mediated Learning. 5(3&4): 221-236. Gibbons, A.S. and Fairweather, P.G (1998) Instructional Strategy III: Fragmentation and Integration Chapter 15 in Gibbons and Fairweather. Computer-Based Instruction: Design and Development. Englewood Cliffs, NJ: Educational Technology Publications, 278-296.
Expand the Test Blueprint Build
a interactive test specification blueprint that can be shared internationally and modified by any SME Include exemplary and multiple item types that SMEs deem useful for each major content division Use creativity and rapid prototyping to select and display relevant assessment situations and item types
Expand the Item Banking System Focus
on Item Ideas before Item Writing Write and store items in an item/task/simulation/rubric banking system. Generate and critique item ideas from multiple perspectives before you start to write the item or task. Extend item bank to include multimedia components
Item Idea Visualization Process
Expand the Test Review Dynamic review by SMEs of test items and tasks within the test delivery system Use interactive webconferencing system using a WYSIWYG display
Item accuracy Item relevancy Item importance Item grammar, syntax Item Bias
Scientific Research on Validity, Reliability, Fairness Create
Testable domain theories that can be verified and validated. Collect ongoing evidence o exam validity, reliability and fairness. Continually improve exam and items based on the evidence collected.
Test analysis Item analysis IRT Calibrations Differential item functioning Suggested revisions
Explore alternative scoring algorithms for complex items and tasks
Dichotomous Scoring Polytomous Scoring Partial Credit Scoring Weighted Items, Options, Logical Answer Expressions, Subscores Scoring on multiple dimensions
Explore New Testing and Measurement Models
Computerized Mastery Testing Computerized Adaptive Testing Testlet Adaptive Testing Linear on the Fly Testing Multistage Testing Decision Theoretic Testing Performance Testing Blended Testing Approaches
Computerized Educational Measurement (Bunderson, Inouye and Olsen, 1989)
Generation 1: Computerized Testing traditional tests converted for administration by computer. Generation 2: Computerized Adaptive Testing with adaptation with item difficulty, discrimination or item time. Generation 3: Continuous Measurement with instruction closely integrated with measured assessment. Generation 4: Intelligent Measurement with adaptation to individual learner profiles, expertise, and aptitude/trait complexes (e.g., Lee Cronbach, Richard Snow, and recently Phillip Ackerman) and to job performance or competence levels.
Computerized Educational Measurement (Bunderson, Inouye and Olsen, 1989)
Test Security is a natural driver between the Generation 1 and 2 Measurement of competence is a natural driver from Generation 2 to 3 Measurement of generated work models is also a natural driver from Generation 2 to 3. Dynamic simulations can be viewed as a 3rd or 4th Generation form of adapting the display and the processing level in response to changes in the simulated system. Movement from academic items to performance testing is another significant branch of Generation 4.
Computers in Educational Assessment (U.S. Congress, OTA, Bunderson, Olsen and Greenberg, 1990) 1.
2.
3.
4.
Increase the frequency and variety of help systems [progress assessment] compared to high stakes assessment. Greatly increase the frequency of formative assessment. Increase the use of human judgment and measures of more complex, integrated, and strategic objectives. Foster use of new item types and portable assessment devices
5.
6.
7.
8.
Create infrastructure of integrated learning and assessment systems. Encourage development of professional skills in integrating assessment and instruction. Stimulate investment in improving technologybased assessment practice. Maintain high professional standards as the field evolves.
Integration of Instruction and Assessment “In the highly individualized systems of computerbased and computer-monitored instruction of the future, it will be almost impossible to distinguish the testing materials from the teaching materials.” E.F. Lindquist, 1969 p. 368) “Make measurement do a better job of facilitating learning for all individuals” (Robert Linn, 1989, p. 9) “NOT as a process quite apart from instruction, but an integral part of it.” (Ralph Tyler, 1951, p. 47)
Revised Bloom’s Taxonomy for Teaching and Learning (2001)
Performance Testing Design and Development Taxonomy Level of Fidelity
Knowledge Based Testing
Item Types
Performanc e Based Testing
• Multiple Choice • Drag and Drop • Hot Area • Fill-in-the-Blank
Supporting Practice Analysis Techniques
Performanc e Testing
• Simulations • Emulations • Live Application • Virtual Reality • Board Level
• Scenarios • Case Study • Essay
Job/Task Analysis (JTA)
Outcome/ Competency Analysis
Bloom’s Taxonomy Knowl edg e
Compre hens ion
Applic ation
Analysis
Synthesis
Evaluati on
Gagne’s Learning Outcomes Correlated to Learning Theories and Mental Models
Verbal Informati on
Classify Co ncepts
Problem So lvin g
Cogn itive Strategy
Guilford’s Mental Processes Memory
Cogn ition
Conver gent Prod uction
Diverg ent Producti on
Evaluati on
Revised Bloom’s Taxonomy (Anderson & Krathwohl)
PTC Design & Delivery Focus Area
Validity Centered Design
A validity model for enhancing licensure and certification test constructs and consequential validity. The validity centered design model that can be employed with selected response items, constructed response items and performance items.
Validity Centered Design (Bunderson, C.V. 2003; Olsen, J. 2005)
I. Design for Usability, Appeal, Positive Expectations (User-centered design) 1. Overall Appeal, Relevance 2. Usability 3. Value & positive consequences (Perceived) II. Design for Inherent Construct Validity 4. Content 5. Thinking processes 6. Structure (number of dimensions) III. Design for Criterion-Related Validity 7. Generalizability 8. External convergent/ 9. Consequential (+/- ) discriminant
Validity-Centered Design
VCD creates a domain theory of measurable progressive attainments (competencies) along one or more measured scales of progress The design side of VCD uses rapid prototyping continuing into well-evaluated cycles of Plan Implement Evaluate Revise. It is not an sequential Analysis Design Develop Implement Evaluate model, but a cyclically improving model.
Five Subsystems of Learning Progress Systems 1.
2.
3.
4. 5.
The Learning Progress Map or interpretive framework The Measurement System to track and post progress on the Map The Instructional System – both on-line and blended aspects The Evaluation Plan and Process The Implementation Plan and Process
What problem are we solving? Compare Two Test Information Curves High
High
Cut
Certification or Mastery Low
Low
Test Info
Learning Progress Multiple Purposes Test Info
Student Assessment vs. Domain Measurement High
Student assessment event
Low
Measuring progress in a domain for multiple purposes: (using continuous data) • assessing moving progress • basis for feedback to learners • measures to evaluate design alternatives - instruction, adapting to individuals, implementation plans (takes on challenge of building unidimensional scales)
What products result? Validity Centered Design Continuous progress measurement with immediate feedback and instruction, and cycles of system improvement based on principled design experiments. Learning Progress Systems
Score Interpretation Validity Centered Design - emphasis is placed on developing an interpretable framework for students and teachers -- across the entire substantive content or proficiency domain. Requires construction of essentially unidimensional scales Requires considerable interface design work to make it appealing, easy to use, clearly interpretable, and with useful navigation features.
Approaches to validity. Consider the measurement, interpretation, action cycle
Observe; Compare
Measure
Interpret
Take Action
Qualitative Study
It’s not the measurement instrument that is valid or invalid; It’s the inferences made from it (interpretation & action)
Need to add other elements for a complete validity argument
Evidential Basis
Consequential Basis
Test Interpretation
Test Use
Construct Validity
Construct Validity +
Construct Validity
Value Implications
Relevance/Utility Construct Validity Relevance/Utility Value Implications Social Consequences
“Constructing Construct Validity” (Messick 1998)
Validity Centered Design Validity centered design aspires to do more than guide and design an assessment system. It also includes a measurement system. Validity centered design may be used for multiple ongoing, cyclical evaluations of not only the students, but also the measurement system itself, the instructional materials delivered on the same computers as the measurement system, the adaptive research system that includes adaptation to individual differences, and the strategic implementation of any learning and content management systems.
Virtual Laboratory Performance Tests
Virtual Laboratory Demonstration Open above hyperlink Login: sas01 Password: welcome Click checkmark to submit
Conclusions and Recommendations
Conduct a job or practice analysis to effectively ground the assessment. Use an analysis process to define the tasks and subtasks followed by a synthesis process to create meaningful clusters or aggregates of the tasks and subtasks. Organize these meaningful clusters of tasks and subtasks in work models or performance models.
Conclusions and Recommendations
Determine the most appropriate types of content display representation including media, display and response information. Ensure high fidelity to the thinking processes and content domain elements required in the work or performance models. Develop a test blueprint or specification that identifies the content domain dimensions and elements as well as the thinking processes or level of cognitive demand
Conclusions and Recommendations
Select and determine appropriate balance between item and task types. The costs and benefits of using each item or task type should be investigated. Include the item and test types based on this reasoned analysis The assessment designer should create sufficient test items of each required item type and pilot test the items with a representative sample of individuals from the target population for the assessment.
Conclusions and Recommendations
Analyze item and test results from the pilot tests. Select specific items, tasks, and scoring criteria that should be used for the operational assessment. Design and validate scoring models for exam, subtests, tasks, items. Set a performance standard or proficiency levels for the exam. Conduct ongoing psychometric analysis of the items and tests
Conclusions and Recommendations
Conduct ongoing validity, reliability, fairness, and usability analyses of exam scores and interpretations. Investigate the value of evidence centered design and validity centered design approaches. Develop a plan and operational procedures to ensure appropriate security for the test items, tasks, scoring criteria and examinee results.
Final Comment “Science [validity] is an ongoing subject with new research always modifying old ideas.” Richard Feynman, letter to Sandor Solt April 25, 1969