An achievement test is any test designed to assess an individual’s attainment of a specific knowledge or skill in a specified content area within which the individual has received some level of instruction or training. However, achievement tests are often confused with aptitude tests. Aptitude tests may not differ in form from achievement tests, but they typically differ in both use and interpretation. Aptitude tests are typically designed to estimate an individual’s future performance on a task and/or his or her aptitude to develop new skills or knowledge if provided instruction or training. Essentially, achievement tests assess current performance after specific training; aptitude tests assess the potential for future performance.
For more than 100 years, achievement and aptitude testing have steadily gained momentum and support from psychologists, educators, policymakers, and the general public. Recent laws at both federal and state levels clearly demonstrate the emerging importance of achievement and aptitude testing in a data-driven political system. However, regardless of such apparent support, considerable confusion remains about the nature, use, and appropriate interpretation of achievement tests.
Because achievement and aptitude tests are frequently used in combination with one another, a brief discussion of aptitude tests is warranted. The Sixteenth Mental Measurements Yearbook (MMY) groups aptitude and ability tests into a single classification (i.e., Intelligence and Scholastic Aptitude) that includes measures of general or specific knowledge and aptitudes or cognitive abilities. As mentioned, aptitude tests are essentially measurements of an individual’s performance on a selected task or tasks, which are then used to predict that same individual’s future performance. They can assist external parties in predicting performance in selection processes, and help individuals gain a better understanding of their abilities in making life decisions (e.g., career or educational choices). The MMY includes a plethora of assessments under this category, including those of verbal and nonverbal reasoning; critical, abstract, and creative thinking; cognitive and mental abilities (including traditional intelligence tests); memory aptitudes; and learning aptitude, potential, and efficiency.
The predictions made from aptitude test results are not always limited to tasks or situations that are similar to those initially measured. In fact, some aptitude tests focus on predicting seemingly unconnected tasks and skills, while others are used to predict future performance in entirely different situations. For instance, high school students interested in particular careers might be given aptitude tests to measure their aptitudes for those careers. The students’ test results can then be used to help advise them about available academic or training programs beyond high school (rather than simply whether they should actually pursue the specified careers).
Aptitude tests can also vary in the number of aptitudes measured by a single instrument. Multiaptitude batteries are aptitude tests that measure a broad array of ability areas (e.g., verbal reasoning, numerical reasoning, and mechanical reasoning) during a single administration. These batteries are used primarily for intellectual, educational, and vocational assessment, and they are well suited to show individuals’ relative strengths and weaknesses. For that reason, multiaptitude batteries are generally more useful in career and academic counseling than are single-aptitude assessments.
One of the more common multiaptitude batteries is the Armed Services Vocational Aptitude Battery (ASVAB), which was first developed in 1966 and is now on forms 23 and 24. Most, if not all, new recruits to the U.S. Armed Forces take the ASVAB. It measures aptitudes for general academic areas and career areas that are involved in most civilian and military careers. Scores from the eight subtests can be used to locate possible career options in OCCU-FIND—a manual listing more than 400 occupations, including about 150 military careers.
Although multiaptitude batteries are more useful than single-aptitude tests in certain circumstances, there are also instances in which more specialized aptitude tests are preferable. For instance, broad-based multiaptitude tests such as the Wechsler Adult Intelligence Scale (WAIS) can predict a variety of cognitive and mental aptitudes. However, they do not measure all possible cognitive abilities, and they do not necessarily provide the most accurate predictions of future performance in specialized tasks, such as mechanics, art, and music. In fact, the MMY provides a classification of specialized assessment instruments (e.g., fine arts, mathematics, reading, science, and social studies) and lists both aptitude and ability tests within each group. The following examples provide some indication of the broad array of specialized aptitude tests available to psychologists:
- The Mechanical Aptitude Test is a 45-minute test that measures high school students’ and adults’ mechanical abilities, such as comprehension of mechanical tasks, use of tools and materials, and matching tools with operations.
- The O’Connor Finger Dexterity Test assesses psychomotor aptitudes (i.e., ability to perform bodily movements), and it is used to predict how well a person would be able to perform certain motor tasks in various situations (e.g., working on a rapid assembly, performing watch repair).
- The Meier Art Tests are examples of specialized assessments for artistic aptitude. Among these tests is one for Aesthetic Perception. This test presents an examinee with four versions of the same artistic work that differ along an important aesthetic dimension (e.g., proportion or form). The individual ranks the works in order of merit, and the results can be used to predict the individual’s future success in tasks involving these aesthetic concepts.
- The Seashore Measures of Musical Talents is a 60-minute assessment of musical aptitude. The assessment battery includes a listening test with six subtests measuring dimensions of auditory discrimination (e.g., pitch, loudness, rhythm, and tonal memory).
Admissions tests are some of the most commonly used assessments within the realm of aptitude and achievement tests, yet they are also the most difficult to define according to traditional definitions. Confusion can arise when applying the standard definition of either achievement or aptitude tests to scholastic assessments, because scholastic ability/aptitude tests combine the predictive goals of aptitude tests with the performance assessment goals of achievement tests. As such, it is not uncommon for classification systems to place admissions tests into a hybrid category.
The SAT is one of the three most common admissions tests, and a prime example of such confusion about whether admission tests are achievement or aptitude tests. Originally introduced in 1901, the SAT is now taken by over 2 million students annually and is accepted by nearly every American college and university as the entrance examination component of the admissions process. The debate about the purpose and usefulness of the SAT led to several changes in its name throughout the 20th century. The SAT was first introduced as the “Scholastic Achievement Test,” renamed the “Scholastic Aptitude Test” in 1941, and became the “Scholastic Assessment Test” in 1990. Following the 1994 revision, and continuing with the most recent 2005 revision, “SAT” is no longer an acronym. The test is presently known as the “SAT Reasoning Test.”
The other widely used hybrid tests are the ACT and Graduate Records Examination (GRE). Similar to the SAT, the “American College Test” was renamed “ACT” in 1996. Most American colleges and universities use the ACT and GRE, respectively, to make decisions about the admission of applicants to undergraduate and graduate programs of study.
Although hybrid tests incorporate elements of achievement tests, more traditionally defined achievements tests are clearly differentiated from aptitude and ability tests. The focus of achievement tests on measuring acquired knowledge makes them the primary type of instrument used in educational programs at all levels. Although this essential element is consistent across achievement tests, these tests can be further categorized using several nonexclusive characteristics.
Standardized Versus Nonstandardized Achievement Tests
One characteristic that can be used to distinguish among achievement tests is whether the test has been standardized. Standardized achievement tests are those that have been administered, revised, and tested to establish an average level of performance. Standardization allows an individual’s test results to be compared to those of other test takers. Because the individual’s achievement is compared to that of a reference group, scores on standardized achievements tests are generally indicated by a percentile rank. Scores may also be indicated using grade-level equivalency (e.g., an eighth-grade student scores a 10 on a standardized achievement test, indicating that she scored as well as the average tenth-grade student).
Although standardized tests are generally considered more robust and valid measures of achievement, the majority of achievement tests used in educational settings are nonstandardized. Such nonstandardized tests include exams, tests, and other instances where the intent is to simply indicate how much an individual learned, without referencing a specific performance standard established by a reference group. Toward this end, nonstandardized tests essentially assess individual achievement as a proportion of the maximum potential level of achievement, as defined by the trainer, educator, or external test developer. Nonstandardized tests can be scored more subjectively (e.g., essay tests and short answer tests) or more objectively (e.g., multiple choice and matching tests), but the ultimate score will always be a proportion of the total potential achievement. Typically, scores are reported as pass or fail, a percentage of the total possible score (e.g., 93% out of possible 100%), a letter grade (e.g., A, B), or a number grade (e.g., 17 out of 32).
Norm-Referenced Versus Criterion-Referenced Achievement Tests
As mentioned above, standardized achievement tests require referencing an individual’s performance to an established standard level of performance. There are two methods for establishing these standardized performance levels: norm referencing (also known as nomathetic and standards referencing) and criterion referencing (also known as idiographic and domain referencing). Norm-referenced achievement tests compare each individual’s achievement to the achievement of others taking the same measure. As such, achievement level is based on the average performance of the norm group, rather than on the actual percentage of correct answers. In order to enhance such comparison of individual scores to the norm group, norm-referenced tests are typically created to mimic the normal curve. Individuals are then provided a scaled score or percentile rank according to the normal curve. Some of the most common norm-referenced tests are the California Achievement Test (CAT), Comprehensive Test of Basic Skills (CTBS), and Tests of Academic Proficiency (TAP).
There are several criticisms of norm-referenced achievement tests. For instance, because norm-referenced achievement tests are designed for national or international use, there is a possibility that the content being tested is not covered by the education or training actually provided to the individual. When this difficulty becomes salient, instructors sometimes change the material they teach, which leads to the criticism that some instructors are “teaching to the test.” In addition to content, critics note that the norms of many achievement tests are too old to measure achievement according to current standards and/or teaching methods. Furthermore, the norms may be too limited to provide meaningful normative comparisons for all demographic groups, specifically those of culture or ethnicity. There are also arguments that such assessments may sacrifice accuracy or breadth in order to ensure that examinees’ scores conform to a normal distribution. In addition, a mathematical property of the normal curve is that changes in the number of correct answers do not lead to the same change in the percentile rank for all individuals. These arguments have led major test makers to address criticisms through redesign and/or renorming of their achievement tests, and to emphasize that norm-referenced achievement tests should not be the sole basis for making critical decisions about students’ retention or graduation.
Although most achievement tests are norm referenced, their limitations have led to the continued use of criterion-referenced tests in certain situations. Criterion-referenced tests compare each individual’s performance to a predetermined standard or criterion level, rather than a norm group. They focus on mastery of a given objective or skill, and typically include many items measuring a single objective. Because criterion-referenced tests are scored against an absolute standard, usually the percentage of correct answers, criterion-referenced tests are more common in the daily assessment of individuals in educational settings. Unlike norm-referenced achievement tests that force individuals into a normal curve, criterion-referenced tests do not limit the number of examinees that can demonstrate outstanding performance and mastery.
In an effort to draw upon the strengths of both norm-referenced and criterion-referenced tests, some achievement tests incorporate both standardization procedures. Scores on such tests are reported in terms of both how the examinees compare to others and how well they mastered the assessed content. For instance, the TerraNova (also known as the California Achievement Test, Sixth Edition or CAT/6) indicates both a student’s grade equivalency and that student’s level of mastery.
Individually Versus Group-Administered Tests
The majority of achievement tests can be administered to a group of individuals. This is particularly useful in educational settings where thousands of students might take the same instrument within a similar time frame. However, information such as behavioral observations can be obtained only during an individual test administration. Individualized achievement assessment is particularly useful for assessing the vocational rehabilitation of adults and learning disabilities of children and adolescents. There are several individually-administered achievement tests, as well as many that can be administered either to individuals or to groups.
Survey Achievement Batteries
Achievement tests also vary in terms of the number of achievement domains being assessed. Survey achievement batteries, which assess a broad array of areas, are the most widely used format. The survey achievement battery typically has a number of subject-based subtests. They are most commonly used to assess achievement in the areas emphasized in K-12 education, thereby providing educators with information about student achievement across the educational curriculum with a single administration. One of the most popular of the survey achievement batteries is the Iowa Test of Basic Skills, designed for students in kindergarten through Grade 8. This battery assesses achievement in areas such as vocabulary, reading comprehension, language, mathematics, spelling, science, maps and diagrams, and reference materials.
Examples of Achievement Tests
While there are hundreds of achievement tests, the following provide examples of the more frequently administered instruments:
- The Woodcock-Johnson® III (WJ III) is a widely used comprehensive system (i.e., hybrid battery) for measuring general intellectual ability (or g), specific cognitive abilities, scholastic aptitude, oral language, and academic achievement. These variables are measured through two distinct batteries: The WJ III Tests of Cognitive Abilities and the WJ III Tests of Achievement. The WJ III can be administered to any individual over the age of 2. Because of its breadth and its assessment of achievement, the WJ III is often used to diagnose learning disabilities, guide educational programs, assess growth, and identify discrepancies between an individual’s levels of aptitude and achievement.
- The Wechsler Individual Achievement Test— Second Edition (WIAT-II), is a test of reading and mathematics achievement that is suitable for individuals age 4 and older. The WIAT-II evaluates both the correctness of the response and the process by which the examinee arrived at the response, thus allowing for a more accurate assessment of problem-solving skills than other achievement measures. The WIAT-II is also conormed with the Wechsler Intelligence Scale for Children—Fourth Edition (WISC-IV), which is the most commonly used test of intellectual and cognitive functioning for children.
- The Wide Range Achievement Test 4 (WRAT4) is designed to assess individuals between the ages of 5 and 75. The WRAT4 assesses the achievement of reading, spelling, and arithmetic skills.
- The Kaufman Test of Educational Achievement, Second Edition (K-TEA II) provides a broad assessment of academic achievement. It can be administered in a longer (five-subtest) comprehensive form or a brief (three-subtest) screening form to students in first through twelfth grade. Both the comprehensive and brief versions provide an assessment of key academic skills in reading, mathematics, written language, and oral language.
Federally Mandated Statewide Achievement Tests
On January 8, 2002, the No Child Left Behind Act of 2001 (NCLB) was enacted, which emphasized standards and accountability with educational systems. The NCLB was intended to hold states, school districts, and schools accountable for the adequate education of American youth. Specific initiatives within NCLB were designed to assess and reduce the achievement gaps that existed between ethnic/racial minority students and majority group students, and among students from differing socioeconomic statuses. The NCLB act created the necessity for an ongoing assessment of achievement on a national scale.
Since 1969, the National Assessment of Educational Progress (NAEP) has provided the only national assessment of students’ achievement in major academic areas. However, the NAEP originally assessed only a national sample of fourth-grade students every 2 to 4 years. Currently, under the NCLB, the NAEP assesses a national sample of fourth- and eighth-grade students every 2 years. However, there are clear limitations of such a national norm-based test. In particular, it is impossible for a single measure to accurately test all individual academic standards as defined by each state’s education department. As such, the NCLB requires all states and territories receiving federal funding to develop and implement a procedure to independently obtain achievement data on all public school students.
The purpose of the state assessment requirement is to provide an independent, objective measure of the educational progress of each student, school, school district, and state/territory. These assessments are expected to measure how well each student achieves the individual state’s academic standards in reading, mathematics, and science. Academic standards have been developed by every state to indicate what students at particular grade levels should learn in specific subject areas. The underlying assumption is that students will perform well on the state achievement tests if teachers are competent and cover the material required by the standards. Hence, education departments argue that there should be no need to “teach to the test” and/or engage in specific test preparation or coaching.
The NCLB mandates that the state assessment procedure must include at least one criterion-referenced or norm-referenced assessment, may include multiple assessments, must address the depth and breadth of the state academic standards, and must be reliable and valid. Because all students are expected to achieve the same high levels of learning, the NCLB requires states to hold all public elementary and secondary school students to the same academic content and achievement standards. Thus, the same rigorous test must be used throughout the entire state. Beginning with the 2007-2008 academic year, all states must administer annual achievement tests for reading/language arts and mathematics in each of Grades 3 through 8, and at least once in Grades 10 through 12. In addition, annual achievement tests for science must be given at least once in each of the following: Grades 3 through 5, Grades 6 through 9, and Grades 10 through 12.
As with other high-stakes tests, data obtained from state achievement tests can have a significant effect on the future of the test takers. Aggregated results can have subsequent effects on schools, school districts, and even states. The most basic use of statewide achievement tests is to provide teachers and administrators with information about individuals, and tailor services to address a student’s demonstrated difficulties. The “high stakes” designation of such achievement tests comes from the fact that many states use a student’s performance to determine whether students have gained enough knowledge to progress to the next grade level. Although the Standards for Educational and Psychological Testing stress the importance of taking into consideration several points of data when making such decisions, many states consider achievement test results to overrule all other measures.
Statewide achievement data can also provide information on the curriculum being used and the quality of instruction being provided. In this way, the statewide assessment also becomes “high stakes” for teachers and their curriculum. Poor results could indicate a need to revise the curriculum to better achieve the desired academic standards, or a need to provide the teacher with additional training to improve the quality of instruction. In some states, teachers who have a large percentage of low-achieving students may be removed from their positions, as may principals and superintendents. Conversely, teachers who have a large percentage of high-achieving students sometimes receive additional monetary compensation as a reward for their success.
Statewide assessments also become “high stakes” for the schools and school districts themselves, as the NCLB requires that the scores of all eligible students be aggregated to determine whether the school or school district made “adequate yearly progress” during the course of a specified period of time, generally 2 years. In so doing, the state specifies a minimum level of improvement in student performance that schools must achieve. This minimum is based on the performance of the lowest-achieving demographic group or lowest-achieving schools in the state. The state then sets a threshold for adequate progress, and this threshold is raised at least once every 3 years. The goal is that, at the end of 12 years, all students in every state will demonstrate adequate levels of achievement on the respective state assessments. Schools that do not make adequate yearly progress can be required to develop a corrective action plan or to fund options for students to attend another school or to receive additional tutoring. In addition, the school district could initiate a restructuring that results in the replacement of all or most of the staff, or in the assumption of school operations by the state or a private company.
Pitfalls in Statewide Achievement Testing
Although achievement tests have demonstrated usefulness in a variety of situations, controversy and criticisms still exist about how these tests are developed, standardized, and utilized. One such criticism is that improvements in the reliability, accuracy, and validity of many achievement tests have narrowed their range of application. For instance, some test developers have improved norm referencing by standardizing their tests on multiple norm groups based on age, ethnicity/race, and seasonal norms. As a consequence, each form of the test is applicable to a more narrowly defined group. Therefore, it is important to select achievement tests that reliably and validly mea-sure the intended subject and that yield results that are generalizable to the population being tested.
Although achievement tests are useful tools to help guide decisions, one of the strongest criticisms of achievement testing surfaces when educational decisions are made solely upon the basis of achievement test results. Such criticism becomes particularly salient when critics focus upon differential treatment of cultural groups (e.g., ethnicity, income level, special needs). For instance, the American Civil Liberties Unions (ACLU) has expressed concern about the federal requirement that the same test be given to those with special needs and/or limited English proficiency. Other critics argue that achievement tests do not uniformly indicate achievement across other cultural groups, such as ethnicity and gender. Although well intended, such criticisms are generally antiquated, as development or revisions of the most widely used achievement tests have addressed many previous criticisms about the differential applicability to various populations. Indeed, meta-analyses on hundreds of thousands of examinees indicate that the most widely used achievement tests do not differ in their predictive accuracy as a function of ethnicity or gender.
More specific to statewide achievement testing, critics have argued that such assessments evaluate the curriculum and instructional method used by teachers, rather than students’ abilities to learn the information provided. Such critics argue that negative consequences of achievement test results (e.g., holding a student back a grade) could punish the student for the failure of the school or teacher. Critics further argue that this becomes particularly problematic in the nation’s poorest schools, where the quality of education is substantially lower than in more affluent schools. Although there is little research to refute the claim that curricula and teacher quality are positively correlated to student achievement, counterarguments to such criticism often reference research studies and meta-analyses that reveal a plethora of other variables affecting student achievement. Such variables include per-pupil expenditures, larger class sizes, parent involvement, student motivation, school attendance rate, and student satisfaction with school. Furthermore, it is important to note that the intent of statewide achievement testing is not to make educational decisions for individual students, but to identify schools and school districts that require additional assistance and/or resources to ensure students make gains in academic achievement.
Because achievement tests can only measure a limited amount of information, they are unable to assess the full range of information learned by the student or the ability of the student to apply information they have learned in real-world situations. Related to this, critics argue that achievement tests lead teachers to overemphasize memorization and de-emphasize thinking and the application of knowledge. Because of the limited nature of achievement tests, “teaching to the test” can be effective in increasing achievement scores, but it also narrows and weakens academic curricula. Critics argue that this can lead schools to remove courses that do not clearly promote the rote memorization necessary to score high on achievement tests (e.g., physical education, art, and music). Indeed, as the federal government has defined “improvement” in terms of achievement test results and has tied funding to this definition, critics argue that schools are shifting from education to test coaching. That being said, proponents of statewide achievement testing note a distinct difference between “teaching to the test” and “teaching the test,” the former being when teachers align their curriculum to proven indicators of student success and the later being when teachers provide only the exact information on the test. Indeed, proponents argue that teaching knowledge and skills based on a standard curriculum with specific indicators, which are then assessed by the achievement test, is considered “curriculum alignment” and is believed to lead to higher quality schools and curriculum.
For more than 10 decades, achievement and aptitude tests have gained popularity and drawn considerable attention from psychologists, researchers, educators, and the general public. Although initially achievement tests were heavily criticized and limited in their applicability, the past four decades have seen dramatic improvements in their quality, reliability, validity, and generalizability. Such improvements have been further enhanced and expedited by state and federal funding in the wake of federal law requiring the statewide achievement testing of all public school children. Psychologists and educators can now choose from hundreds of research-based achievement tests with demonstrated reliability and validity. It is not surprising that the growing popularity and the high-stakes nature of achievement testing in many settings (e.g., educational, forensic, diagnostic) has led to new criticisms and a rehashing of older criticisms. However, nearly every criticism of achievement testing has been largely refuted by psychological and educational research. Ultimately, when selected, applied, and interpreted appropriately and professionally, achievement tests provide the best means for the assessment of acquired knowledge and skills of all individuals in a wide variety of settings.
- Aiken, L. R., & Groth-Marnat, G. (2006). Psychological testing and assessment (12th ed.). Boston: Pearson Education.
- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (1999). The Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
- American Psychological Association, Joint Committee on Testing Practices. (2005). Code of Fair Testing Practices in Education. Washington, DC: Author.
- Carpenter, S. (2001). The high stakes of educational testing. Monitor on Psychology, 32(4).
- Darling-Hammond, L. (1999). Teacher quality and student achievement: A review of state policy evidence. Seattle, WA: Center for the Study of Teaching and Policy.
- Gronlund, N. E. (1998). Assessment of student achievement (6th ed.). Needham Heights, MA: Allyn & Bacon.
- Kane, T. J., & Staiger, D. O. (2002). The promise and pitfalls of using imprecise school accountability measures. The Journal of Economic Perspectives, 16(4), 91-114.
- Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32(7), 3-13.
- National Council on Measurement in Education. (1995). Code of Professional Responsibilities in Educational Measurement. Madison, WI: Author.
- Resnick, M. (2003). NCLB Action alert: Tools & tactics for making the law work. Alexandria, VA: National School Boards Association. Spies, R. A., & Plake, B. S. (Eds.). (2005). The Sixteenth Mental Measurements Yearbook. Lincoln, NE: Buros Institute of Mental Measurements.