Student Assessment
Educational psychologists have been major players in the measurement of student performance, virtually defining a national achievement curriculum. The keystone is the standardized test. Alternatives have appeared under labels like authentic or performance-based assessment, along, with packages for placement in special programs (learning handicapped, emotionally distressed. gifted, English-language learners).
Linking these assessments are criteria and methods to establish validity and reliability and an overarching system of constructs and standards. Validity assures that an assessment measures a well-defined construct. For example, a reading test should test “reading.” Reliability refers to the trustworthiness of an instrument. Groups of judges rating student compositions must agree among themselves.
Several tensions trouble the field of student assessment, reflecting the importance of schooling for the individual and the society. Most significant is locus of control, pitting the classroom teacher against more distant authorities (the school district, the state, or even the federal government). In developed countries, centralized testing often determines admission to secondary and postsecondary schooling. Local control of U.S. schools is a long-standing tradition. This tradition has been challenged, however, as states and the federal government provide increased funding for public education and concomitant demands for accountability. Some teachers have resisted pressure to “teach to the test.” offering alternative methods of their own devising.
Multiple-Choice Methods
The standardized achievement test is, without doubt, the most important creation of educational psychologists. It impacts most individuals throughout their life.
From kindergartners’ school readiness to examinations for entry to graduate programs, individuals are judged by marks on an answer sheet.
Test development begins with construct definition, typically in terms of behavioral objectives. For example, identify the topic sentence in a paragraph or calculate the sum of four 2-digit numbers in column format. Writing and revising items is the next step. The item stem poses the question, and the choices provide answers. One is correct and the others reflect degrees of wrongness. Plausible alternatives increase item difficulty. Scripted instructions determine test administration, including time allocations, scoring information, and interpretation. Publishers conduct extensive trial runs, ensuring users of test reliability and providing normative data like averages and percentiles. They also offer scoring services. The teacher does little more than distribute booklets, read instructions, and package booklets for shipment.
Test development is only part of a larger enterprise. Test theorists and publishers rely on psychometric methods to transform scores from “percentage correct” to normative indicators like grade level equivalent, percentile, and normal curve equivalent. These indicators provide test users with general measures that compare individuals with a larger population. Classical psycho-metrics began with the normal “bell-shaped” curve but now employ a wide range of techniques, including factor analysis, item-response theory, generalizability design and analysis.
Criterion-referenced methods appeared in the 1950s as an alternative to normative indicators, the focus on whether students meet absolute and predefined standards. Standard setting begins with professional judgment about performance levels, or what constitutes adequate and exceptional achievement. Tests remain the same, but scores are interpreted differently.
Standardized tests serve various purposes in schools. They compare students and schools. For example, students with high test scores are admitted to prestigious universities or may be identified in third grade as gifted. Parents search out schools with high achievement scores. School improvement and program effectiveness are gauged by standardized measures; a few points up or down can leave educators celebrating or depressed.
Tests influence other facets of schooling. In the elementary grades, teachers monitor reading and mathematics achievement by curriculum-embedded, end-of-unit tests .that mirror standardized instruments. Test batteries accompany high school and college textbooks, allowing instructors the convenience of cut-and-paste examinations, and publishers provide do-it-yourself manuals for constructing tests.
Performance-Based Assessment
Standardized tests have had critics from the outset. Criticisms are that tests are low-level, and differences due to socioeconomic status, ethnicity, and language background signify test bias. Not until the 1970s did alternatives emerge, Movements like whole language, hands-on math, and discovery-based science militated against standardization and externally mandated tests, stressing instead the teacher’s role in adapting instruction to student needs, and the validity of portfolios, exhibitions, and projects.
Performance is the distinctive feature that sets these methods apart from multiple-choice tests. The techniques span a wide range, At one end are on-demand writing tests; students have an hour or less to write a composition on a predetermined prompt, with no resources, no questions, no chance to revise. At the other extreme are free-form portfolios, collections assembled over weeks or months to demonstrate learning. Individual students decide what to put in the folder and may even judge the quality of the collection.
How are performance samples evaluated? Olympic games like diving and gymnastics serve as metaphors. Judges confer about the characteristics of a quality performance and then evaluate each participant on rating scales or rubrics for specific performance (analytic ratings) and for overall quality (holistic ratings). Psychometric techniques apply to some of these judgments; interrater agreement provides an index of consistency, for instance. Performance-based methods possess considerable face validity; students must directly “do” what they have learned, rather than simply select a correct answer. Trustworthiness is more problematic. Although raters can learn to make consistent judgments, students may look very different depending on the task.
Performance-based methods sprang from practice rather than policy, from classrooms rather than state houses, from teachers rather than publishers. They require human judgment and are expensive. Nonetheless, substantial efforts are underway to adapt these methods for large-scale assessment. On-demand writing tests are now commonplace. Several states complement multiple-choice tests with projects and portfolios, and Vermont relies entirely on these approaches. The demand for high standards provides continuing impetus for the use of performance assessment, the argument being that there is no substitute for demonstrating competence in complex and demanding tasks. For teachers, the connection to classroom practice is compelling, as is the opportunity to gauge student interest and motivation.
Current assessment practice varies from the primary grades through graduate school. Young students are just learning the school game. and standardized tests reflect early home preparation more than individual potential; performance assessments, therefore, are more appropriate. From the late elementary grades through entrance to postsecondary education, multiple-choice tests reach a peak. Afterward, performance samples, such as application letters, thesis papers, and dissertations, become critical.
Assessment for Categorical Placement
This topic does not fit under the previous headings but has become increasingly important because of government funding of categorical programs like special education, Regulations govern assessment practices, but psychologists play important roles in setting local policy and actual implementation. Government funding for disadvantaged students depends on family characteristics like poverty more than achievement. Assessment is important as part of the debate about program effectiveness accountability, that is, whether the investment is justified by student learning.
Categorical programs depend heavily on assessment for selection of students, determination of appropriate services, and exit to regular education. For these assessments, professionals (often psychologists) employ regulated (and expensive) clinical methods, combining teacher recommendations, standardized instruments, interviews and observations, and family consultations.
Teacher Assessment
Only in recent decades has the evaluation of teachers emerged as a significant research topic. Assessment methods vary within levels of teacher development: admission to preservice programs, initial licensure, and induction leading to tenure. The trend is to use standardized procedures for entry-level decisions (e.g,. admission to training programs) and performance-based methods for professional advancement decisions (e.g., tenure).
Because of concerns about applicant quality, college students planning to enter teaching must now demonstrate basic skills in many states. The multiple-choice tests resemble those given to high school students, with the same advantages and limitations. High failure rates by underrepresented minorities mean that many potential teacher candidates are denied access to the field. The tests have been challenged as biased and unrelated to teaching potential; the counterargument is that every teacher should possess a minimum level of competence.
Following preservice preparation and during the first few years of service, teachers are in turn licensed and then inducted into tenure positions. During these steps, which most states regulate heavily, candidates undergo serious and sustained evaluation. Prior to 1990. the National Teacher Examination (NTE), a multiple-choice test covering teaching practices and content knowledge, often served for licensure. The NTE was criticized as lacking validity because it did not assess “real teaching.” In the late 1980s, Educational Testing Service introduced Praxis. a combination of computer-based tests of basic skills, paper-pencil exercises of subject-matter knowledge, and performance-based observations. Praxis has greater face validity and appears more closely linked to practice.
Professional preparation in teaching is “thin” compared with other fields. You can track the progress of doctors, nurses, lawyers, and accountants by certificates on office walls, once a teacher has acquired tenure. However, opportunities for professional development are scarce and go unrecognized. In 1987, the National Board for Professional Teaching Standards was formed to develop and promote methods for assessing excellent teaching. Teachers desiring to move beyond initial licensure can now apply for an intensive experience composed of ten performance exercises; the teacher prepares six at the local school, and four are administered during a one-day session in an assessment center. The classroom exercises include instructional videotapes and student work samples, which the candidate must analyze and interpret. At the assessment center, the candidate reviews prescribed lesson materials and designs sample lessons. Panels of expert teachers rate each portfolio and award certificates of accomplishment. The standards are high, and pass rates have been modest. Some states now give certificated teachers pay incentives, but the movement has yet to catch on.
Two final issues warrant brief mention. The first is reliance on student achievement as an indicator of teaching effectiveness. Teacher associations like the National Education Association and the American Federation of Teachers oppose this policy, arguing that student scores reflect many factors the teacher cannot control. States increasingly hold schools responsible for achievement standards. Although the focus is the school, teachers share incentive payments for exceptional school-wide performance and must deal with the consequences of low scores.
The second issue centers around teacher knowledge of assessment procedures. Externally mandated tests receive most attention, but teachers also rely on their own observations and classroom assessments to judge student learning. How trustworthy are teacher judgments? How knowledgeable are they about standardized tests? Surveys show that teachers receive little preparation in assessment concepts and methods and typically rely on intuition and prepackaged methods. Some educators have proposed the concept of “assessment as inquiry” to support classroom-based methods like portfolios and exhibitions. but with little effect on practice thus far.
Administrator Assessment
Teacher evaluation has not captured the same attention as student assessment but even less attention has been given to assessment of principals and superintendents. One might think that school leaders should be required to demonstrate their knowledge and skill, both to enter their positions and as part of continuing professional development. In fact, work in this area is sparse, with few contributions by psychologists. The research foundations are limited but are emerging around leadership concepts and practical needs.
Administrators typically attend more to budgets and personnel matters than to teaching and student learning, except when schools stand out as exceptional or in dire straits. Research suggests that effective schools are correlated with strong administrative leadership; unfortunately, less is known about how to assess or support leadership. The criterion for effectiveness has typically been standardized student performance. Analogous to an assembly-line model, the administrator’s task is to increase the output. Newer models stress human relations and organizational integrity but much remains to be done.
What Has Endured and What Is Valuable?
Standardized multiple-choice tests will remain most likely primary indicators of student achievement. Performance-based methods for large-scale accountability, a closer link between classroom assessment and local reporting of student achievement, and clinical strategies like the best of those found in categorical programs all offer alternative assessment models for the future. The new methods have stimulated public debate about the outcomes of schooling and about the trustworthiness of methods for judging the quality of educational programs. Equity issues are a significant element in these debates. Assessment data show that U.S. schools are doing reasonably well for students in affluent neighborhoods but are failing families in the inner cities and poor rural areas. Indicators can serve to blame victims or to guide improvements. We have much yet to learn about methods for supporting the second strategy.
Educational Psychology Bibliography:
- American Educational Research Association, American Psychological Association. & National Education Association. (1985). Standards for educational and psychological testing. Washington. DC: Author.
- American Federation of Teachers, National Council on Measurement in Education. & National Education Association. (1990). Standards for teacher competence in educational assessment of students. Educational Measurement: Issues and Practice, 9 (4), 30-32.
- Berliner, D, A.. & Calfee, R. C. (Eds.). (1996). Handbook of educational psychology, New York: Macmillan. Part 2 of the Handbook includes chapters on individual differences among students, emphasizing a broad span of assessment concepts and practices in the achievement domain, along with motivation, attitudes, and aptitudes, ranging from preschool through adulthood. Chapter 23 describes methods for teacher evaluation from selection through licensing and induction and on to professional certification, including descriptions of NBPTS and Praxis.
- Bloom, B. S., Hastings, J. T., & Madaus, G. F. (1971). Handbook of formative and summative evaluation of student learning. New York: McGraw-Hill. A classic presentation of a broad range of testing and assessment methods based on behavioral principles that undergirded the design of standardized tests, as well as many classroom and textbook assessments from the 1960s up through the present time.
- Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Describes in readable prose (except for a few technical asides) the concepts and methods underlying standardized tests, along with techniques for addressing problems of group bias.
- Candoli, I. C., Cullen, K., & Stufflebeam, D. L. (Eds.). (1997). Superintendent performance evaluation: Current practice and directions for improvement. Boston, MA: Kluwer. Part of a series that uses the Personnel evaluation standards as a foundation.
- Glaser, R., & Linn, R. (1997). Assessment in transition. Stanford, CA: National Academy of Education. The focus of this paperback is the National Assessment of Educational Progress, the “nation’s report card.” But the book also covers a broad range of issues in the assessment of student achievement in nontechnical language and sets the stage for discussions of state and national policy about how to find out how students are doing in our schools.
- Herman, J. L., Aschbacher, P. R.. & Winters, L. (1992). A practical guide to alternative assessment. Alexandria, VA: Association for Supervision and Curriculum Development.
- Joint Committee on Standards for Educational Evaluation. (1981). Standards for evaluations of educational programs, projects, and materials. New York: McGraw-Hill. Several organizations have established standards for educational assessment practices. Implementation of the standards is voluntary in most instances, but the quality of the recommendations is uniformly high.
- Linn, R. L. (Ed.). (1989). Educational Measurement (3rd ed.). New York: Macmillan. The technical foundations for measuring student achievement, based largely on multiple-choice tests. Although the techniques have broader applications, most of the examples assume “right-wrong” answers. The handbook covers validity and reliability, methods for scaling achievement, along with special chapters on cognitive psychology and measurement, computers and testing, and practical applications of test scores.
- Mitchell, J. V, Jr., Wise, S. L., & Plake, B. S. (Eds.). Assessment of teaching: Purposes, practices, and implications for the profession. Hillsdale, NJ: Erlbaum. Describes a wide range of methods for assessing teaching knowledge and practice for selection and tenure decisions at local level, grounded in concept that improving education depends on improving teaching,
- Nettles, M. T., & Nettles, A. L. (Eds.). (1995). Equity and excellence in educational testing and assessment. Boston, MA: Kluwer.
- Office of Technology Assessment. (1992). Testing in American schools: Asking the right questions. (OTA-SET-519). Washington DC: U.S. Government Printing Office. An up-to-date history of standardized testing.
- Phye, G. D. (Ed.). (1997). Handbook of classroom assessment: Learning, adjustment, and achievement. San Diego, CA: Academic Press.
- Richardson, V. (Ed.). Handbook of research on teaching (4th ed,). New York: Macmillan. This series offers an important historical perspective on evaluation of teachers and teachers’ evaluations of students. The first edition discusses various methods for studying teaching but does not connect these with evaluation per se. The second edition contains a chapter on assessment of teacher competence as well as a chapter on observation as a method for teacher evaluation. The third edition includes a chapter on the “measurement of teaching,” which describes relations between teacher activities and student performance on standardized tests.
- Shinkfield, A. J.. & Stufflebeam, D. L. (Eds.), (1995). Teacher evaluation: Guide to effective practice. Boston, MA: Kluwer Academic Publishers. Offers a review of current research along with practical suggestions.
- Stiggins, R. J. (1994). Student-centered classroom assessment. New York: Merrill.
- Wiggins, G. P. (1993). Assessing student performance. San Francisco: Jossey-Bass.