Reliability

Reliability can be defined as the extent to which scores of a measure are free from the effect of measurement error. Measurement error is reflected in random deviations of the scores observed on a measure from respondents’ true scores, which are the expected values of respondents’ scores if they completed the measure an infinite number of times. Mathematically, reliability is quantified as the ratio of true score variance to observed score variance or, equivalently, the square of the correlation between true scores and observed scores. Based on these indexes, reliability can range from zero (no true score variance) to one (no measurement error).

Reliability is important for both practical and theoretical purposes. Practically, it enables estimation of the standard error of measurement, an index of accuracy of a person’s test score. Theoretically, reliability contributes to theory development by allowing researchers to correct for the biasing effect of measurement error on observed correlations between measures of psychological constructs and by providing researchers with an assessment of whether their measurement process needs to be improved (e.g., if reliability is low).

Sources of Measurement Error

Multiple sources of measurement error can influence a person’s observed score. The following sources are common in psychological measures.

Random Response Error

Random response error is caused by momentary variations in attention, mental efficiency, or distractions within a given occasion. It is specific to a moment when a person responds to an item on a measure. For example, a person might provide different answers to the same item appearing in different places on a measure.

Transient Error

Whereas random response error occurs within an occasion, transient error occurs across occasions. Transient errors are produced by temporal variations in respondents’ mood and feelings across occasions. For example, any given respondent might score differently on a measure administered on two occasions. Theoretically, such temporal differences are random, and thus not part of a person’s true score, because they do not correlate with scores from the measure completed on other occasions (i.e., they are occasion specific).

Specific Factor Error

Specific factor error reflects idiosyncratic responses to some element of the measurement situation. For example, when responding to test items, respondents might interpret item wording differently. Theoretically, specific factors are not part of a person’s true score because they do not correlate with scores on other elements (e.g., items) of the measure.

Rater Error

Rater error arises only when a person’s observed score (rating) is obtained from another person or set of persons (raters). Rater error arises from the rater’s idiosyncratic perceptions of a ratee’s standing on the construct of interest. Theoretically, idiosyncratic rater factors are not part of a person’s true score because they do not correlate with ratings provided by other raters (i.e., they are rater specific).

Types of Reliability Coefficients

Reliability is indexed with a reliability coefficient. There are several types of reliability coefficients, and they differ with regard to the sources of observed score variance that they treat as true score and error variance. Sources of variance that are treated as error variance in one type of coefficient may be treated as true score variance in other types.

Internal Consistency

This type of reliability coefficient is found most frequently in psychological research (e.g., Cronbach’s alpha, split-half). Internal consistency reliability coefficients, also known as coefficients of equivalence, require only one administration of a measure and index the effects of specific factor error and random response error on observed scores. They reflect the degree of consistency between item-level scores on a measure. Because all items on a given measure are administered on the same occasion, they share a source of variance (i.e., transient error) that may be unrelated to the target construct of interest but nonetheless contributes to true score variance in these coefficients (because it is a shared source of variance across items).

Test-Retest

Test-retest reliability coefficients, also known as coefficients of stability, index the effects of random response error and transient error on observed scores. Test-retest coefficients reflect the degree of stability in test scores across occasions and can be thought of as the correlation between the same test administered on different occasions. Because the same test is administered on each occasion, the scores from each occasion share a source of variance (i.e., specific factor error) that may be unrelated to the target construct of interest but nonetheless contributes to true score variance in these coefficients (because it is a shared source of variance across occasions).

Coefficients of Equivalence and Stability

Coefficients of equivalence and stability index the effects of specific factor error, transient error, and random response error on observed scores. These coefficients reflect the consistency of scores across items on a test and the stability of test scores across occasions; they can be thought of as the correlation between two parallel forms of a measure administered on different occasions. The use of different forms enables estimation of specific factor error and random response error, and the administration on different occasions enables estimation of transient error and random response error. Therefore, this coefficient can be seen as a combination of the coefficient of equivalence and the coefficient of stability. Hence, the coefficient of equivalence and stability is the recommended reliability estimate for most self-report measures because it appropriately accounts for all three sources of measurement error, leaving none of these sources of variance to contribute to the estimate of true score variance.

Intrarater Reliability

Intrarater reliability coefficients—a type of internal consistency coefficient that is specific to ratings-based measures—index the effects of specific factor error and random response error on observed score variance. These coefficients reflect the degree of consistency between items rated by a given rater on one occasion. Because the items are rated by the same rater (intrarater) on the same occasion, they share two sources of variance (i.e., rater error and transient error) that may be unrelated to the construct of interest but nonetheless contribute to true score variance in these coefficients (because they are shared sources of variance across items).

Interrater Reliability

Like intrarater reliability coefficients, interrater reliability coefficients are also specific to ratings-based measures. However, interrater reliability coefficients index the effect of rater error and random response error on observed score variance. They reflect the degree of consistency in ratings provided by different raters and can be thought of as the correlation between ratings from different raters using a single measure on one occasion. Because the same ratings measure is administered to different raters (interrater) on the same occasion, the ratings share two sources of variance (i.e., specific factor error and transient error) that may be unrelated to the target construct of interest but nonetheless contribute to true score variance in these coefficients (because they are a shared source of variance across raters).

Estimating Reliability Coefficients

Methods for estimating the coefficients just described are provided by two psychometric theories: classical test theory and generalizability (G) theory. Researchers who adopt a classical test theory approach to the estimation of coefficients often calculate Pearson correlations between elements of the measure (e.g., items, raters, and occasions) and then use the Spearman-Brown prophecy formula to adjust the estimate for the number of items, raters, or occasions across which observations on the measure were gathered. Conversely, researchers who adopt a G-theory approach focus on first estimating components of the reliability coefficients (i.e., true score variance, or universe score variance in G-theory terms, and error variance) and then form a ratio with these estimates to arrive at an estimated reliability coefficient (generalizability coefficient in G-theory terms).

Factors Affecting Reliability Estimates

Several factors can affect the magnitude of reliability coefficients that researchers report for a measure. Their potential impact on any given estimate must be considered in order for an appropriate interpretation of the estimate to be made.

Measurement Design Limitations

The magnitude of a reliability coefficient depends partly on the sources of variance that are treated as error. Unfortunately, not all measurement designs allow estimation of all types of reliability coefficients. Thus, even though a researcher may wish to consider a source of variance in his or her measure as error, it may not always be possible to account for it in the measurement design. For example, researchers cannot index the amount of transient error variance in observed scores if the measure (or at least parts of it) was not administered on multiple occasions. In such a case, the researcher may have to report a reliability coefficient that overestimates the true reliability of the measure.

Constructs Being Measured

Items measuring different constructs may be differentially susceptible to sources of measurement error. For example, items for broader constructs (e.g., conscientiousness) are likely to be more strongly affected by specific factor error than items for narrower constructs (e.g., orderliness). Similarly, items measuring stable personality constructs (e.g., the Big Five) may be less susceptible to transient error than items measuring affect-related constructs.

Heterogeneity of the Sample

It is well-known that range restriction attenuates correlations between variables. Because reliability coefficients can be interpreted as the square of the correlation between observed scores and true scores, they, too, are subject to range restriction. Reliability estimates tend to be higher when they are obtained from a sample of persons who vary greatly on the construct being measured and lower if the persons in the sample do not vary greatly on the construct.

Test Length

Scores on a measure are typically formed by summing or averaging responses across items. Because specific factor errors associated with items are uncor-related, their contributions to the observed score variance when summed or averaged diminish in proportion to the number of items included in the measure. Hence, all else being equal, the more items on the measure, the higher its reliability.

References:

  1. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105— 146). New York: American Council on Education.
  2. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
  3. Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199—223.
  4. Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs. Psychological Methods, 8, 206—224.
  5. Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage.