An important criterion on which psychological measures are judged is the degree to which their scores reflect persons’ true standing on an attribute of interest, such as cognitive ability and conscientiousness. Measurement theories recognize that scores on a measure reflect at least two components: a true component and an error component. Although theories differ in terms of the way they define these components, the degree of relation between them, and the types of error on which they focus, they all share a concern for measurement error. Generalizability theory (G-theory) is a measurement theory that provides methods for estimating the contribution of multiple sources of error to scores and quantifying their combined effect with a single index—a generalizability coefficient (G-coefficient).
Fundamentals of G-Theory
At the root of G-theory is the idea that the variability in persons’ scores because of error (i.e., error variance) can be partitioned into components, each reflecting a different source of error. For example, in attempting to measure a person’s level of interpersonal skill using an interview, error might arise from
- the specific question asked, such as differences in how interviewees interpret the question;
- the specific interviewer conducting the interview, such as differences in the familiarity of the interviewer with each interviewee; and
- the particular occasion on which the interview was conducted, such as the mood of the interviewee on the day of the interview.
All the differences noted previously could influence a person’s interview score for reasons that have nothing to do with the person’s interpersonal skills. By taking a fine-grained approach to examining error, G-theorists gain critical insight into the factors that decrease the quality of their measures.
Partitioning Variance in G-Theory
Within G-theory, variance in scores is typically partitioned through analysis of variance (ANOVA). The type of ANOVA conducted follows from the measurement design, which describes how a given attribute is measured. G-theorists describe measurement designs in terms of facets of measurement—the set of measurement conditions under which data on the objects of measurement (the entities being measured) are gathered. Continuing with the interview example, facets of measurement might include questions, interviewers, and occasions; whereas the objects of measurement would be interviewees. In G-theory, facets and objects of measurement serve as factors in an ANOVA model that is used to generate estimates of their contributions (as well as their interactions’ contributions) to variance in scores.
Defining True Variance and Error Variance in G-Theory
Estimates of variance attributable to the object of measurement, facets, and their interactions are often referred to as variance components. The variance component associated with the object of measurement is interpreted as an estimate for true variance—the amount of variability in scores that is attributable to differences between objects of measurement such as interviewees on the attribute of interest including interpersonal skill. G-theorists refer to such variance as universe score variance. Whether a particular variance component is interpreted as error depends on the types of inferences the researcher wants to draw regarding the objects of measurement and the facets of measurement across which the researcher wants to generalize scores.
To illustrate this dependence, consider the interview example discussed earlier. If inferences are restricted to the relative ordering of interviewees on interpersonal skill, only those sources of variance that lead to different orderings of interviewees on interpersonal skill would be defined as error. In G-theory such error is referred to as relative error. Relative error is evidenced by interactions between the objects such as interviewees and facets including questions and interviewers of measurement. For example, the larger the interviewee-by-question interaction, the more the relative ordering of interviewees on interpersonal skill differs depending on the question asked. When error is defined in relative terms, G-coefficients reflect the consistency with which the objects of measurement are ordered on the attribute of interest across facets such as interview questions, interviewers, and occasions. Technically, a G-coefficient is defined as the ratio of universe score variance to universe score variance plus error variance, and it ranges from 0 to 1.
Alternatively, someone might wish to make inferences about persons’ true standing on some attribute compared with a fixed standard such as a cut score or performance standard. Such absolute comparisons are labeled criterion-referenced comparisons. In this case any source of variation causing an observed score to differ from a true score would be defined as error. In
G-theory this type of error is referred to as absolute error. Absolute error includes not only interactions between of the objects and facets of measurement but also main effects of the facets (e.g., variation in mean interpersonal skill scores across questions because of differences in question difficulty). Facet main effects do not contribute to relative error because they do not affect the relative ordering of objects of measurement; rather they only affect the distance between objects’ observed scores and true scores. When error is defined in absolute terms, G-coefficients (often called phi-coefficients when error is defined in absolute terms) reflect an estimate of absolute agreement regarding the standing of the objects of measurement on the attribute of interest across facets of measurement.
Decisions regarding which sources of variance are defined as error also depend on the facets across which the researcher wishes the scores to generalize. Returning to the interview example, to generalize interpersonal skill scores across questions, any inconsistency in the relative ordering of interviewees (or in mean score differences, if absolute error is a concern) across questions would be considered error.
Although the aforementioned example describes generalizing across a single facet (i.e., questions), there may be a need to generalize across other facets as well, such as interviewers. When considering two or more sources of error, there is the potential for interactions between the sources. For example, an interviewee’s interpersonal skill score may depend not only on the question used to assess interpersonal skill but also on the specific interviewer who rated the interviewee’s response to that question such as an interviewee-by-question-by-interviewer interaction.
Limitations in Measurement Designs
A key insight made clear by G-theory is that not all measurement designs allow researchers to estimate the sources of error that may be of concern to them. For example, assume that in implementing the interview described previously, the same interviewer conducts one interview with each interviewee. Although error may arise from the particular interviewer used, as well as the particular occasion on which the interview was conducted, it is not possible to estimate the contribution of these sources of error to observed interview scores based on this measurement design. To determine whether the relative ordering of interviewees on interpersonal skill differs across interviewers or occasions, obtain ratings for each interviewee from multiple interviewers on multiple occasions. Thus the fact that this particular measurement design only involved one interviewer and a single administration of the interview prevents assessing the generalizability of interview scores across interviewers and occasions.
The measurement design in the aforementioned example is also problematic in that the estimate for true variance in interpersonal skill (if variance in observed interview scores were decomposed) partially reflects variance arising from the interviewee-by-interviewer and interviewee-by-occasion interactions. To eliminate the variance attributable to these interactions from the estimate of true variance requires multiple interviewers to rate each interviewee on multiple occasions. Thus just because a given measurement design prohibits estimating the impact of a source of error on observed scores does not imply that the error is eliminated from a measure. Indeed, the error is still present but hidden from the researcher’s view and, in this example, inseparable from the estimate of true variance.
G-Theory as a Process
When introduced by Lee J. Cronbach and his colleagues more than 40 years ago, the application of G-theory was conceptualized in terms of a two-step process for developing and implementing a generalizable measurement procedure. The first step in the process is to conduct a generalizability study (G-study). The purpose of the G-study is to gather data on a given measure using a measurement design that allows the researcher to generate estimates of all error sources of concern (and therefore avoid limitations raised in the previous section). With such estimates, the researcher could estimate what the generalizability of the measure would be under various potential measurement conditions. For example, based on findings from the G-study, researchers could estimate the number of observations needed for each facet of their measurement design (e.g., number of questions, interviewers, occasions) to achieve a desired level of generalizability in their measure. The second study, called a decision study (D-study), would involve implementing the measurement procedure identified via the G-study to gather data on the persons for whom decisions are to be made.
Although practical constraints often obviate this two-stage approach, it can have substantial value for improving industrial/organizational (I/O) research. Specifically, it forces researchers to give forethought to, and acquire knowledge of, the sources of error that are of concern to them in their measures. With such knowledge researchers can take steps to improve measurement procedures by identifying where refinements might have the most impact (targeting the largest sources of error) prior to having to implement their measure to make real decisions (e.g., whom to hire, whom to promote). In other words, G-theory offers a clear process for building improved measures.
- Brennan, R. L. (2001). Generalizability theory. New York: Springer-Verlag.
- Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1973). The dependability of behavioral measurements: Theory of generalizability forscores and profiles. New York: Wiley.
- DeShon, R. P. (1998). A cautionary note on measurement error corrections in structural equation models. Psychological Methods, 3, 412-423.
- DeShon, R. P. (2002). Generalizability theory. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations: Advances in measurement and data analysis (pp. 189-220). San Francisco: Jossey-Bass.
- Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105-146). New York: American Council on Education and Macmillan.
- Shavelson, R. J., & Webb, N. M. (2001). Generalizability theory: A primer. Thousand Oaks, CA: Sage.