Differential Item Functioning

Differential item functioning (DIF) is the preferred psychometric term for what is otherwise known as item bias. An item displays DIF when test takers possessing the same amount of an ability or trait, but belonging to different subgroups, do not share the same likelihood of correctly answering the item. Thus, differentially functioning items elicit different responses from test takers of the same ability level. Because subgroups of test takers are often defined in terms of demographic membership (e.g., sex, race, socioeconomic status), items displaying DIF are sometimes considered “biased” against a particular subgroup. Consider, for example, the items on a standardized test of verbal ability. If the content of one item is sports related, then boys of a particular ability level may have an unfair advantage over girls of that same general verbal ability level. Thus, the item favors boys because it measures a trait other than (or in addition to) verbal ability (in this case, sports knowledge).

It is important to note, however, that the mere presence of item score differences among subgroups does not necessarily indicate the presence of DIF. To return to the example of standardized tests, we would expect 12th-grade examinees to perform better on a verbal ability test than 9th-grade examinees taking the same test. Score differences between these groups would not result because the test is biased against ninth graders, but because of true overall differences in verbal ability. A true between-group difference in ability is referred to as impact, which is conceptually distinct from DIF. Complex statistical procedures exist for distinguishing when differences stem from inherent group differences or item bias.

Levels of Analysis

Differential item functioning is a statistical phenomenon that can occur in any item of a test. As DIF accumulates in several items, it can produce differential functioning in clusters of items called bundles. The items constituting a bundle may refer to a common reading passage, assess a common skill, share the same grammatical structure, or be of the same item type. This summative effect of DIF is called DIF amplification, which allows item-level effects to impact examinee scores at multiple levels of analysis. For instance, prior research on DIF amplification has demonstrated that single items on a history test favoring females yielded a substantially greater advantage when examined together as a bundle. Subgroup differences on such bundles indicate the presence of differential bundle functioning (DBF).

In addition, the effects of item-level effects can be examined in all test items simultaneously. Differential test functioning (DTF) occurs when test takers of the same ability do not receive the same overall test score. Because modern researchers and practitioners of applied psychology are most often interested in a bottom-line verdict (e.g., does it make sense to compare the results of this employee opinion survey across male and female workers?), greater emphasis is placed on test-level analyses that allow items favoring one group to cancel the DIF of items favoring another group.

Detecting Differential Item Functioning

Early approaches to DIF detection relied on analysis of variance (ANOVA) models that treated DIF as an item-by-group interaction and focused on the percentage of examinees responding correctly to each item (p-values). This approach has been criticized as simply an omnibus test of p-value differences, which confound DIF with impact. At present, many DIF detection techniques exist that more effectively operationalize DIF. These techniques can be grouped into those that operate directly on the raw item responses of test takers (nonparametric methods) and those that evaluate the estimated parameters of item-response theory models (parametric methods). Among the more popular nonparametric techniques are the Mantel-Haenszel method, logistic regression procedures, and the simultaneous item bias test (SIBTEST). Popular parametric techniques include Lord’s chi-square statistic, the likelihood ratio technique, and the differential functioning of items and tests (DFIT) framework.

References:

  1. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased items. Thousand Oaks, CA: Sage.
  2. Raju, N. S., & Ellis, B. B. (2002). Differential item and test functioning. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations: Advances in measurement and data analysis (pp. 156188). San Francisco: Jossey-Bass.