Standardized Testing

Since the early 20th century, the United States has been the foremost developer and consumer of testing technology in the world. Tests have been used widely by the U.S. military, government and civilian employers, and educational institutions to improve selection, placement, and promotion decisions. However, the pervasiveness of testing in American life, starting as early as age six, has called into question the purported benefits of testing, led to intense scrutiny of organizational decisions, and raised concerns about the general impact of testing on society. Although some of these criticisms are certainly justified, standardized tests, the most common targets of public rebuke, are among the best assessment devices available and, in our view, do not deserve the bad rap they have been given in the popular press.

The term standardized tests originally referred to tests using uniform administration procedures. Over time, the term has evolved to describe tests that measure constructs related to academic achievement and aptitude, that are administered to a very large number of examinees on a regular basis (usually in a group format), and that have a variety of normative information available for interpreting scores. Today, all modern standardized tests are (a) constructed, validated, and normed using large and diverse samples, (b) routinely updated to reflect changes in curricula and social context, (c) administered under uniform conditions to eliminate extraneous sources of variation in scores, and (d) examined using advanced psychometric methods (e.g., item response theory) to detect and eliminate measurement and predictive bias. All of these features help make standardized tests reliable and valid assessments of the constructs they are intended to measure. The tests are continuously being improved and revised to incorporate advances in psychometric theory, substantive research, and testing technology.

Standardized tests can be roughly grouped into three general types: (a) educational achievement and aptitude tests, (b) military and civil service classification tests, and (c) licensure and certification exams. Each type of test has a different purpose, but the main psychometric features are similar. In the sections that follow, a brief overview of each test type is provided, followed by a discussion of the important issues regarding standardized test use and future development.

Educational Achievement and Aptitude Tests

By far, large-scale educational assessments constitute the largest portion of standardized tests. These include instruments designed to measure student achievement in primary and secondary schools, as well as those developed to assess a student’s academic aptitude to perform successfully at a university (both undergraduate and graduate levels). The most well-known primary and secondary school test batteries are the Iowa Test of Basic Skills, the Metropolitan Achievement Test, and the Comprehensive Test of Basic Skills. Each of these instruments aims to provide a thorough and integrative coverage of major academic skills and curricular areas and contains subtests covering different topics (i.e., reading, science) and grade ranges. The advantage of these batteries over earlier objective achievement tests is that their subtests have been normed on the same sample of students, which allows for relatively straightforward comparisons within and across individuals and groups. Collectively, these tests are referred to as achievement tests, emphasizing the retrospective purpose of the assessment. Their main goal is to gain information about a student’s learning accomplishments and to identify deficiencies as early as possible.

College admission tests, on the other hand, are often called aptitude tests because their main purpose is to make predictions about future academic performance. The two most widely taken exams are the

Scholastic Aptitude Test (SAT) and the ACT assessment (American College Testing Program), which are used mainly for undergraduate university admissions. Tests for admission to graduate and professional programs include the Graduate Record Examination (GRE), the Graduate Management Aptitude Test (GMAT), the Law School Admission Test (LSAT), and the Medical College Admissions Test (MCAT).

The GRE, SAT, GMAT, and LSAT all measure basic verbal, mathematical, and analytical skills acquired over long periods of time; however, good performance on these tests does not depend heavily on recently acquired content knowledge. On the other hand, the ACT, MCAT, GRE subject tests, and the SAT II tests do require knowledge in specific content areas, and thus they are much more closely tied to educational curricula. Consequently, it has been argued that, despite their prospective use, tests such as the ACT are more appropriately referred to as achievement tests. Yet, as many researchers have noted, the distinction between aptitude and achievement is a fine and perhaps unnecessary one. So-called aptitude and achievement test scores tend to correlate about .9 because individuals high in general ability also tend to acquire content knowledge very quickly. On the whole, it is safe to say that all of these tests measure an examinee’s current repertoire of knowledge and skills related to academic performance.

Military and Civil Service Classification Tests

Military classification tests are the earliest examples of standardized tests developed in the United States. As part of the World War I effort, a group of psychologists developed and implemented the Army Alpha and Army Beta exams, which were designed to efficiently screen and place a large number of draftees. High-quality multiple aptitude test batteries, such as the Army General Classification Test (AGCT), emerged during World War II and were instrumental in the area of aviation selection.

The most prominent successor of the AGCT, the Armed Services Vocational Aptitude Battery (ASVAB), is now widely used to select and classify recruits into hundreds of military occupational specialties. This is accomplished, in part, by using 10 subtests—covering general science, arithmetic reasoning, word knowledge, paragraph comprehension, numeric operations, coding speed, auto and shop information, mathematics knowledge, mechanical comprehension, and electronics information—to measure an array of specific skills rather than a few broad dimensions. The primary difference between these general aptitude tests is that the ASVAB has a stronger mechanical-spatial emphasis and a unique speeded component that enhances its usefulness in predicting performance in technical and clerical jobs.

In the civilian sector, the General Aptitude Test Battery (GATB) was developed by the U.S. Department of Labor in 1947 for screening and referral of job candidates by the United States Employment Service. The GATB uses 12 subtests to measure three general abilities (verbal, numerical, spatial) and five specialized factors, which include clerical perception, motor coordination, and finger dexterity. Like the ASVAB, the inclusion of these subtests, in addition to measures of math, verbal, and general mental ability, makes the GATB predictive of performance in a diverse array of occupations, ranging from high-level, cognitively complex jobs to low-level, nontechnical positions.

Licensure and Certification Exams

Licensure and certification exams represent the third type of standardized tests. These tests are similar to achievement tests in that they assess examinees’ knowledge and skills, but their main purpose is to determine whether examinees meet some minimal level of professional competency. Whereas achievement test scores are generally interpreted with respect to normative standards (e.g., a large representative group of examinees who took the test in 1995), licensure and certification exam scores are meaningful only in relation to a cut score that is tied directly to performance through a standard-setting procedure.

The most popular standard-setting procedure is the Angoff method (named for William H. Angoff), whereby subject-matter experts are asked to indicate the probability that a minimally competent professional would correctly answer each item. This information is combined across items and experts to determine the cut score used for licensure and certification decisions. The key is that scores are interpreted with respect to a defined set of skills that must be mastered. Consequently, in any given year, it is possible that all or no examinees will pass the test. In practice, however, passing rates are often similar from year to year because the average skill level of examinees and educational curricula are slow to change and because test developers may make small adjustments to passing scores to correct for rater effects and to ensure a steady flow of professionals into the field.

Although many licensure and certification exams still contain a number of multiple-choice items similar in form to those on traditional educational tests, some recently revised exams, such as the Architect Registration Examination (ARE) and the American Institute of Certified Public Accountants Exam (the CPA exam), also include some innovative simulation-type items that are designed to mimic the actual tasks performed by professionals in the field. For example, items might require examinees to locate information in an Internet database, enter values and perform calculations using a spreadsheet, design a structure or mechanical system, or write a narrative report conveying a problem and proposed solution to a client. These types of items not only increase the realism and face validity of the tests but also enhance the measurement of integrative, critical-thinking skills, which are difficult to assess using traditional items.

Current and Future Issues In Standardized Testing

For discussion purposes, standardized tests have been divided into three groups, but there are important issues that cut across domains. The greatest overall concern in standardized testing is fairness. Criticisms of standardized tests are fueled by differences in test scores across demographic groups. The popular belief is that these differences result from measurement bias (i.e., a psychometric problem with the instruments). However, most studies suggest that these differences do not result from bias but rather impact, a “true” difference in proficiency across demographic groups. For example, a recent study that examined the relative contributions of bias and impact to observed score differences on the ACT English subtest found that test bias (i.e., differential test functioning) was associated with only .10 of the observed total 12.6 raw score point difference across groups of Black and White examinees. Thus, impact, not bias, poses the biggest problem for college admissions decisions. To the extent that these findings are generalizable, it seems that fairness concerns are best addressed by devoting more attention to the motivational and educational factors influencing test performance rather than searching for a fundamental flaw in the assessment devices.

An issue that is closely connected with bias and fairness is test validity. Many critics have argued that standardized tests do not predict academic or on-the-job performance, and so other types of assessments should be used. However, predictive efficacy is complicated by measurement artifacts (e.g., range restriction and unreliability) that limit the size of the correlations between standardized test scores and performance criteria. Meta-analytic studies, which attempt to correct for these artifacts, have demonstrated that standardized tests are valid predictors of a wide array of outcomes. Four-year grade point averages and work samples do provide comparable validities, but they involve observation over a much longer period of time and, more importantly, make normative comparisons difficult when examinees come from very different backgrounds. On the other hand, tests such as the GRE and SAT make it possible to assess thousands of examinees in a single testing session and provide a common yardstick for comparing examinees from urban schools and community colleges to the most prestigious and selective institutions.

Another issue in standardized testing that has received considerable attention among researchers and test developers is the desire to make exams more accessible to test takers while maintaining a reasonable level of test security. Historically, most standardized tests were offered only a few times per year in a proctored group session format. Security was handled by coordinating testing sessions nationally, using at least one new form per administration, and limiting the public disclosure of items and answers. If, for some reason, a test taker missed or knew in advance that he or she would not be able to attend a testing session, he or she typically had to wait several months for the next opportunity. Needless to say, examinees viewed such timing constraints unfavorably.

Fortunately, advancements in computer technology and psychometric theory now offer many solutions to this problem. Perhaps the most promising development is the widespread availability of computerized adaptive tests (CAT), which allow each examinee to receive a unique sequence of items chosen from a large item pool; items are selected individually or in groups, in real time, to provide near-maximum information about an examinee’s estimated proficiency level. Because the number of items in the testing pool is usually very large (sometimes in the thousands) and item-selection algorithms incorporate stochastic features that provide exposure control, it is unlikely that an examinee would encounter overlapping items upon retesting. Hence, unless there is a substantial coordinated effort among test takers to expose the pool, test security can be maintained reasonably well while offering exams on a more frequent, flexible basis than was possible with paper-and-pencil formats. A related benefit is that scores can be given to examinees immediately upon test completion. Examples of standardized tests that now use some variation of CAT technology are the GRE, ASVAB, and CPA exams.

The last concern in standardized testing is the emerging desire to broaden the scope of aptitudes and skills measured by standardized tests. This effort is being driven largely by organizations that use test score information to make important personnel or admissions decisions. The use of innovative simulation-type items, such as those in the ARE and CPA exams, seems to allow for the assessment of skills that are difficult, if not impossible, to measure using traditional multiple-choice items.

In addition, some testing programs (e.g., military) are seeking to augment cognitive test batteries with subtests measuring noncognitive variables, such as personality and vocational interests, in order to improve not only the prediction of performance but also outcomes such as retention, organizational loyalty, and group cohesion. Of course, making these variables a fundamental part of the decision-making process is not easy because noncognitive assessments are notoriously susceptible to several forms of response distortion (e.g., faking). However, given the number and quality of studies currently being conducted to address this issue, the day when noncognitive subtests become a key component of standardized test batteries may not be far away.

Conclusion

Standardized tests play an important role in American society. The information provided by these tests facilitates the diagnosis, screening, and classification of large numbers of examinees from diverse backgrounds. Standardized tests were created with the aims of test precision, efficiency, and predictive efficacy in mind, and many researchers and practitioners argue these ideals are embodied and represented well, particularly in comparison to other types of psychological assessments. Although this entry has focused on standardized testing in the United States, other countries will certainly experience similar issues as global competition demands more efficient screening and placement of individuals in emerging economies.

References:

  1. Drasgow, F. (2002). The work ahead: A psychometric infrastructure for computerized adaptive tests. In C. N. Mills, M. T. Potenza, J. J. Fremer, & W. C. Ward (Eds.), Computer-based testing: Building the foundation for future assessments (pp. 1-35). Hillsdale, NJ: Lawrence Erlbaum.
  2. Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the predictive validity of Graduate Record Examinations: Implications for graduate student selection and performance. Psychological Bulletin, 127, 162-181.
  3. Murphy, K. R., & Davidshofer, C. O. (2005). Psychological testing: Principles and applications (6th ed.). Upper Saddle River, NJ: Prentice Hall.
  4. Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262-274.
  5. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item/test functioning (DIF/DTF) on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89, 497-508.
  6. Thorndike, R. M. (2005). Measurement and evaluation in psychology and education (7th ed.). Upper Saddle River, NJ: Prentice Hall.

See also: