Computer-Assisted Testing

Computer-assisted testing is the use of computers to support assessment and testing processes. This entry focuses on the history, varieties, and future directions of computer-assisted testing.

History

Computer-assisted testing began in the early 1950s when optical scanners were adapted to read special answer sheets and score tests. This resulted in the widespread use of multiple-choice tests in a variety of testing applications. As mainframe computers became more available, the use of computers in testing expanded.

The first expansion focused on extracting more information from scores on tests with multiple scores. Thus, in addition to scoring tests, computers began to interpret test scores and analyze test data. Score profiles on a number of tests were interpreted by experts, and their knowledge was embodied into computer-generated interpretive reports for instruments scored on multiple scales. Notable examples include the Minnesota Multiphasic Personality Inventory and the Strong Interest Inventory. Interpretive reports have been expanded and improved over the years and are in prominent use today for a number of educational and psychological instruments.

The second expansion occurred in the late 1960s. As computers became more accessible in education, mainframe computers were equipped with multiple terminals that could display information on cathode-ray terminals and accept responses by keyboard. These “dumb” terminals were connected to mainframe computers by dial-up modems that functioned at speeds of 10 to 30 characters per second. Rudimentary “time-sharing” software polled the terminals for responses and transmitted information to the terminals. These hardware configurations gave rise to the first generation of computer-assisted instruction (CAI).

CAI in the 1960s and 1970s consisted of computers functioning as “page turners,” with very basic branching logic to support the instructional process. A screen was presented to the student, the student made a response, and rudimentary computer software determined the next screen to present to the student. Computer-based testing, using the same page-turning approach, was a natural result of this process.

Initially, time-shared computers administered tests on a question-by-question basis. However, communication between terminals and mainframe computers was very slow. The response times of the time-shared systems were unpredictable, and sometimes delays of a minute or more occurred between test questions. This problem seriously affected the standardization of the testing process and the acceptability of CAI. As a consequence, neither CAI nor computer-based testing was very successful in those years.

The development of the minicomputer in the early 1970s was the major hardware advance that allowed computer-assisted testing to flourish. Minicomputers were small (relative to mainframes, but large by today’s standards) and provided a single user with complete access to the hardware and software. As a consequence, software dedicated to the testing process could be written and run independently of other applications. This allowed almost complete control over the system response time between test questions and faster throughput time, resulting in better standardization of the testing process. These capabilities were further enhanced as the personal computer (PC) became widely available in the mid-1980s. Today’s PCs using multithreading and high-speed microchips allow computers to perform extensive computations in fractions of a second.

Varieties of Computer-Assisted Testing

Conventional Testing

The simplest application of computers in test delivery is the administration of conventional tests in which all examinees receive the same test questions in the same order, usually a question at a time. Although this seems like a trivial advance over paper-and-pencil tests, it has a number of advantages. First, all instructions are presented by computer, prior to the examinee receiving the test questions, typically along with some practice questions. This insures that each examinee has read and understood the instructions. Second, scores can be made available to the examinee or test administrator immediately after completion of the test. Furthermore, all examinee responses are recorded electronically, thus eliminating the need to optically scan test answer sheets. The amount of time it takes the examinee to respond to each question can be recorded. This information can be useful in evaluating the examinee’s attention to the task, and it provides information about the examinee’s processing time that might be useful for evaluating his or her performance. No paper is used in the testing process, thereby reducing the expense of reproducing the test materials and filing paper records. Finally, the testing process can be enhanced with audio, video, and color, thus making it possible to measure traits not easily measured in paper-and-pencil test administration.

Branched or Response-Contingent Testing

Branched or response-contingent testing is useful in measuring variables that can be evaluated through a problem-solving scenario or sequence of steps. In this approach, a problem situation is presented to the examinee with a number of alternatives. Each alternative “branches” to a different second stage in the problem-solving process. Subsequent branches for each subsequent question continue to lead to different changes in the situation presented to the examinee. As a consequence, each examinee can follow a different pathway through the problem-solving process, some of which lead to an appropriate solution to the problem whereas others do not.

These “situational” tests are typically scored in terms of the adequacy and efficiency with which an examinee arrives (or does not arrive) at a solution to the presenting problem. Perhaps the most successful implementation of computer-assisted branched tests is in medical training. In this application, hypothetical patients are presented to medical students, along with information that they can access about the “patient.” The student attempts to “cure” the patient by ordering various medical tests and evaluations, drawing conclusions from the information made available interactively through the test, and requesting additional information as needed. The exercises vary in level of difficulty and in the information supplied to challenge the student’s knowledge and skills.

Partially Adaptive Testing

Adaptive tests are designed to adapt to each examinee as the testing process is implemented. Branched or response-contingent tests are adaptive in that sense, but partially and fully adaptive tests take this process further.

Partially adaptive tests operate from a bank of questions that is structured by difficulty. The simplest of these tests consists of subsets of questions grouped into short tests, or testlets, comprising questions of differing average difficulty levels. A testlet of medium difficulty is administered, one question at a time, and immediately scored. Examinees who score high on the testlet then receive a more difficult testlet. Those who score low are then administered an easier testlet. If only two testlets are given to an individual, the test is a two-stage test. A multistage test involves the administration of three or more testlets, with the difficulty of each subsequent testlet based on the examinee’s score on the previous testlet.

In the testlet approach, branching is based on the examinee’s score on each testlet. One variation of this approach involves branching after each question is administered. This allows examinees to move more quickly toward questions that are consistent with their ability level. Other possible partially adaptive structures have also been developed, but they are seldom used because they do not make good use of a question bank. The exception is branched testlets that are used for measuring skills such as reading comprehension, where a number of questions are asked about a given reading passage.

Fully Adaptive Testing

Fully adaptive testing, based on a family of mathematical models called item response theory (IRT), is currently the most used approach to adaptive testing. A fully adaptive computerized adaptive test (CAT) has the five following requirements and characteristics:

It uses a question bank in which all questions have been calibrated by an appropriate IRT model. The IRT family includes models for questions that are scored in two categories (e.g., multiple choice scored as correct or incorrect, true or false, yes or no) and rating scale questions that are scored as multiple categories.
Preexisting information about each examinee (e.g., his or her school grade) can be used as a starting point for selecting questions.
Questions are administered one at a time, and the examinee’s score is estimated after each question is answered.
After each question is administered, the entire question bank is searched and the question that will provide the most precise measurement of that examinee (given the examinee’s score at that point in the test) is selected for administration.
This process of selecting and administering a question and rescoring is repeated until a suitable termination criterion is reached. Fully adaptive CATs can be terminated when the examinee’s score reaches a prespecified level of precision, when there are no more useful questions in the bank for measuring a given examinee, or when the examinee has been reliably classified with respect to one or more cutting scores.

Fully adaptive CATs based on IRT are dramatically shorter than conventional tests, and they reduce the time required for test administration by 50% to 90%. They can measure individuals at much higher levels of precision than conventional tests of the same length. Furthermore, for tests with questions scored in one of two categories (e.g., correct or incorrect), most examinees will answer about 50% of the questions correctly regardless of how high or low their score is. Low-ability examinees are likely to experience the test as “easier” than similar tests that they have taken, because the CAT will have adapted to their ability level by giving them easier questions. Conversely, high-ability examinees will experience the test as more difficult than many they have taken. As a consequence, the “psychological environment” of the test is better equated for all examinees, resulting in an appropriately challenging testing environment. Fully adaptive CAT has been implemented in a number of major testing programs.

Sequential Testing

Many have referred to sequential tests as CATs, but they are a separable set of procedures. Sequential tests are typically used to make a classification decision (e.g., to hire or not to hire, to graduate or not to graduate, or whether someone is or is not depressed) using one or more prespecified cutoff scores. Typically, the questions in the test are ranked in order of how much precision they contribute to making the classification decision. Then, the questions are administered in ranked order until a classification can be made. In contrast to fully adaptive CAT, questions are not selected based on the examinee’s trait level—indeed, sequential tests generally are not designed to measure continuous traits. Since test termination is individualized in sequential testing, however, sequential tests might differ in length among a group of examinees.

Current Issues and Future Directions

Since the advent of the Internet in the late 1990s, a considerable number of tests are delivered through the Internet. Although 20 years of research demonstrated that rigorously designed computer-administered tests were equivalent to or superior to paper-and-pencil tests, the developers of most Internet or Web-based tests have given little thought to equivalence (i.e., Internet or Web-based tests have not been rigorously designed). Consequently, substantial differences might exist between tests delivered on a PC and those delivered through the Web. These differences can affect the standardization and validity of some tests. Some of these factors include:

Different browsers use different settings for fonts, colors, and other display characteristics to deliver Web-based tests. These potentially render a given question differently to different examinees. In addition, differences in screen size and resolution reduce the equivalence of Web-delivered tests to PC-delivered tests. On a PC, the test administration software standardizes the display for all examinees, and a standard monitor can be used throughout a testing room.
Web access and response time vary greatly from question to question. Some of the factors that affect response time include the speed of the examinee’s connection and the amount of traffic on the Web at the instant the examinee responds and receives a new question. Response time is further affected by the speed of the Web server and the other demands on the Web server. For CATs, the computational server time necessary to estimate trait level and select the next question is yet another factor that affects response time. By contrast, on a PC, only a single person is being tested at a time and between-question response time is virtually instantaneous, thus better standardizing test delivery.
When tests are administered in an uncontrolled environment, such as might occur with Web delivery, environmental variables present during test delivery can affect the test performance of individuals. A basic principle of well-standardized testing is that paper-and-pencil tests are to be administered in a quiet and comfortable environment. For the most part PC-based tests also have been administered in testing rooms with a carefully controlled environment. When tests are delivered through the Web, however, a wide variety of extraneous factors might be present that interfere with—and potentially invalidate—the resulting scores. In addition, when individuals take tests without supervision, it is impossible to know who is actually taking the test, what materials they were accessing during test administration, and who was assisting them during testing.

Clearly, considerable research needs to be done to evaluate the comparability of Web-delivered tests to PC-delivered and paper-and-pencil tests. Before Web-delivered tests can be assumed to be replacements for other testing modes, the effects of their lack of standardization and of the physical conditions of testing on test scores must be evaluated. Furthermore, Web-delivered tests must be delivered under supervised conditions to ensure test integrity and validity.

Although computer-assisted testing has made possible the development of new kinds of tests that can take advantage of the multimedia capabilities of PCs, that promise has yet to be realized. Very few computer-administered tests focus on the measurement of new abilities, skills, and personal characteristics that cannot be measured by paper-and-pencil tests. Unrealized possibilities include the development of tests to measure personality characteristics in new ways (e.g., using interactive scenarios and video) and new approaches to measuring individual differences in traits such as memory, reasoning, and complex perceptual skills. These developments, combined with fully adaptive CAT, will help computer-assisted testing to realize its full potential.

References:

Bartram, D., & Hambleton, R. (2005). Computer-based testing and the Internet: Issues and advances. New York: Wiley.
Drasgow, F., & Olson-Buchanan, J. B. (1999). Innovations in computerized assessment. Mahwah, NJ: Lawrence Erlbaum.
Mills, C. N., Potenza, M. T., Fremer, J. J., & Ward, W. C. (2002). Computer-based testing: Building the foundation for future assessments. Mahwah, NJ: Lawrence Erlbaum.
Parshall, C. G., Spray, J. A., Davey, T., & Kalohn, J. (2002). Practical considerations in computer-based testing. New York: Springer-Verlag.
Wainer, H., Dorans, N .J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R., et al. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
Weiss, D. J. (2004). Computerized adaptive testing for effective and efficient measurement in counseling and education. Measurement and Evaluation in Counseling and Development, 37(2), 70-84.