Classical Test Theory

Measurement is the process of quantifying the characteristics of a person or object. Theories of measurement help to explain measurement results (i.e., scores), thereby providing a rationale for how they are interpreted and treated mathematically and statistically. Classical test theory (CTT) is a measurement theory used primarily in psychology, education, and related fields. It was introduced at the beginning of the 20th century and has evolved since then. The majority of tests in psychology and education have been developed based on CTT. This theory is also referred to as true score theory, classical reliability theory, or classical measurement theory.

Classical test theory is based on a set of assumptions regarding the properties of test scores. Although different models of CTT are based on slightly different sets of assumptions, all models share a fundamental premise postulating that the observed score of a person on a test is the sum of two unobservable components, true score and measurement error. True score is generally defined as the expected value of a person’s observed score if the person were tested an infinite number of times on an infinite number of equivalent tests. Therefore, the true score reflects the stable characteristic of the object of measurement (i.e., the person). Measurement error is defined as a random “noise” that causes the observed score to deviate from the true score.

Assumptions of Classical Test Theory

Classical test theory assumes linearity—that is, the regression of the observed score on the true score is linear. This linearity assumption underlies the practice of creating tests from the linear combination of items or subtests. In addition, the following assumptions are often made by classical test theory:

  • The expected value of measurement error within a person is zero.
  • The expected value of measurement error across persons in the population is zero.
  • True score is uncorrelated with measurement error in the population of persons.
  • The variance of observed scores across persons is equal to the sum of the variances of true score and measurement error.
  • Measurement errors of different tests are not correlated.

The first four assumptions can be readily derived from the definitions of true score and measurement error. Thus, they are commonly shared by all the models of CTT. The fifth assumption is also suggested by most of the models because it is needed to estimate reliability. All of these assumptions are generally considered “weak assumptions,” that is, assumptions that are likely to hold true in most data. Some models of CTT make further stronger assumptions that, although they are not needed for deriving most formulas central to the theory, provide estimation convenience:

  • Measurement error is normally distributed within a person and across persons in the population.
  • Distributions of measurement error have the same variance across all levels of true score.

Important Concepts in Classical Test Theory

Reliability and Parallel Tests

True score and measurement error, by definition, are unobservable. However, researchers often need to know how well observed test scores reflect the true scores of interest. In CTT, this is achieved by estimating the reliability of the test, defined as the ratio of true score variance to observed score variance. Alternatively, reliability is sometimes defined as the square of the correlation between the true score and the observed score. Although they are expressed differently, these two definitions are equivalent and can be derived from assumptions underlying CTT.

To estimate reliability, CTT relies on the concept of parallel test forms. Two tests are considered parallel if they have the same observed variance in the population of persons and any person has the same true score on both tests. If these conditions hold, it can be shown that the correlation between two parallel tests provides an estimate of the reliability of the tests.

Validity Versus Reliability

The definition of true score implies an important notion in CTT: that the true score of a person on a measure is not necessarily the same as that person’s value on the construct of interest. Validity concerns how well observed scores on a test reflect a person’s true standing on the construct that the test is meant to measure. As such, validity is a concept that is totally distinct from reliability. Reliability reflects the strength of the link between the observed score and the true score, whereas validity indexes the link between the observed score and the construct of interest. The reliability of a test sets an upper bound for its validity; hence, a test cannot have high validity with low reliability.

Beyond Classical Test Theory

As useful as it is, CTT has certain limitations. It has been criticized for its nonspecific concept of measurement error. Its assumption about the linearity of the regression line of observed score on true score has also been questioned on both theoretical and empirical grounds. Accordingly, more sophisticated theories have been proposed to address those limitations. In particular, generalizability theory explicitly considers the contributions of multiple sources of measurement error to observed scores and offers methods for estimating those effects. Item response theory postulates a nonlinear regression of a person’s responses to a test item on his or her latent ability (a concept that is similar to true score in CTT). These measurement theories offer certain advantages over CTT, but they are more complex and depend on stronger assumptions. Therefore, CTT remains popular because of its simplicity and, more important, the robustness against violations of its basic assumptions.

References:

  1. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105-146). New York: American Council on Education.
  2. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
  3. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
  4. Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage.