Tests Translation and Adaptation

The translation and adaptation of psychological tests used for practice and research requires careful attention to issues of bias and equivalence. Thorough translation methods help reduce bias and enhance equivalence of multilingual versions of a test. Of equal importance is statistical verification of equivalence.

Equivalence addresses the question of comparability of observations and test scores across cultures. Lonner described four types: functional, conceptual, metric, and linguistic equivalence. These refer to issues of comparability of behavior and concepts across cultures to issues of test item characteristics (form, meaning structure). Van de Vijver also discussed four types of equivalence. Construct nonequivalence refers to constructs being so dissimilar across cultures they cannot be compared. Construct equivalence occurs when a scale measures the same underlying construct and nomological network across cultural groups, but may not be defined the same way. With measurement unit equivalence, the measurement scales for the instruments are equivalent (e.g., interval level), but their origins are different across groups. Equivalence at this level may limit comparability of two language versions of an instrument. The origins of the two versions may appear the same (both include interval scales), but because of differential familiarity with the response format used (e.g., Likert scale), the two versions are not identical. The same holds if the two cultural groups vary in response style (e.g., acquiescence). At the highest level of equivalence is scalar equivalence or full score comparability. Equivalent instruments at this level measure a concept with the same interval/ratio scale across cultures and the origins of the scales are similar. At this level, bias has been ruled out and direct cross-cultural comparisons of scores on an instrument can be made.

Bias negatively influences equivalence and refers to factors limiting comparability of test scores across cultural groups. Construct bias occurs when a construct is not identical across cultural groups (e.g., incomplete construct coverage). Method bias may limit scalar equivalence and can stem from specific characteristics of the instrument (e.g., differential response styles) or from its administration. Item bias can result from poor translation and item formulation and because item content may not be equally relevant across cultural groups.

Use of proper translation procedures can minimize bias and help establish equivalence. The International Test Commission (ITC) published translation guidelines to encourage attention to the cross-cultural validity of translated or adapted instruments. The context guidelines emphasize minimizing construct, method, and item bias, and assessing construct similarity or equivalence across cultural groups before embarking on instrument translation. The development guidelines refer to the translation process itself, while the administration guidelines suggest ways to minimize method bias. The interpretation guidelines recommend verification of equivalence between language versions of an instrument.

Two general approaches have been identified when translating or adapting tests. In the applied approach, items are literally translated. Item content is not changed to a new cultural context, and the linguistic and psychological appropriateness of the items are assumed. With the adaptation approach, some items may be literally translated while others require modification of wording and content to enhance their appropriateness to a new culture. This approach is chosen if there is concern with construct bias. For both approaches, attention to equivalence and absence of bias is important. Building on the ITC guidelines and the work of others, the following should be considered when translating or adapting tests.

Bilingual persons fluent in the original and target languages should perform the translation. A single person or committee can be used. Employing test translators who are familiar with the target culture, the construct being assessed, and principles of assessment minimizes item biases that may result from literal translations.

After the translation team has agreed on the best translation, the measure should be independently back-translated by additional person(s) into the original language. The back-translated version is then compared to the original for linguistic equivalence. If the two versions are not identical, the researcher works with the team to revise problematic items through a translation/back-translation process until agreement is reached about equivalence. This process, however, does not guarantee a good scale translation, as it often leads to literal translation at the cost of readability and naturalness of the translated version. To minimize this problem, an expert in linguistics should be consulted. Test instructions also need to go through the translation/back-translation process.

Once there is judgmental evidence of the equivalence of the two language versions, the translated scale needs to be pretested. One approach is administering both versions to bilingual persons. Item responses can then be compared using statistical methods. If item differences are discovered between versions, the translations are reviewed and change accordingly. Additionally, a small group of bilingual individuals can be employed to rate each item from both versions on a predetermined scale in regard to the similarity of meaning conveyed. Problematic items are then refined until satisfactory.

A small sample of participants speaking the target language can also provide verbal or written feedback about each item. The researcher may, for instance, randomly select scale items and ask probing questions (e.g., what do you mean by your response?). Responses considered unfitting an item are scrutinized and the translation changed. This method provides insight into how well the meaning of the original items has fared in the translation. Another method may involve respondents rating their perceptions about item clarity and appropriateness on a pre-determined scale. Unclear items or items not fitting are changed. Finally a focus group approach can be used in which participants respond to the translated version and discuss with the researcher(s) the meaning they associated with the items and their perception about the clarity and cultural appropriateness of the items. Item wording can be changed based on participants’ responses.

Along with the judgmental evidence just mentioned, statistical methods must be performed to verify equivalence and lack of bias. Cronbach’s alpha, item-total scale correlations, and item means and variations provide information about instruments’ psychometric properties. Significantly different reliability coefficients, for example, may indicate item or construct bias. Comparing these statistics across different language versions of an instrument offers preliminary data about equivalence.

Construct, conceptual, and measurement equivalence can also be measured at the scale level using factor analyses, multidimensional scaling, and cluster analysis. Scalar or full score equivalence is more difficult to establish than construct and measurement unit equivalence, and various biases (e.g., item and method bias) may threaten this level of equivalence. Item bias can be found by studying the distribution of item scores for all cultural groups. Item response theory, in which differential item functioning is examined, may be used for this purpose, as can analysis of variance (ANOVA), logistic regression, multiple-group standard error of the mean (SEM) invariance analyses, and multiple-group mean and covariance structures analysis. Last, factors contributing to method bias can be assessed and statistically held constant when measuring constructs across cultures.

There are many examples of psychological measures translated from English to other languages. For instance, the Minnesota Multiphasic Personality Inventory-2 (MMPI-2), including the adolescent form (MMPI-A), is available in nearly 20 languages. Multilingual versions of the Myers-Briggs Type Indicator, Strong Interest Inventory, California Psychological Inventory Sixteen Personality Factor Questionnaire (16 PF), Self-Directed Search, Millon Clinical Multiaxial Inventory-III, Revised NEO Personality Inventory (NEO PI-R), Hare Psychopathy Checklist-Revised, Beck Depression Inventory, State-Trait Anxiety Inventory, and Wechsler Intelligence tests are also available.

Often, information about availability and psychometric properties of translations and adaptations of tests can be accessed from the tests’ developers or distributors. It is unclear, however, how translations of the measures mentioned above were performed and whether the tests were adapted for different cultural and linguistic contexts.

If one uses a test that has been translated into the language of a specific target population, but that has not been specifically developed and normed for that population, there is little to guarantee equivalence across such factors as item difficulty and relevance, cultural bias, comprehension/decoding, and validity within a differing cultural context. Beyond language, culture, and relevance, even the factor structure of a specific test cannot be assumed to exist in an adapted translation. This has, for instance, been observed in discussions regarding a 5- or 6-factor solution for the NEO PI-R in some specific cultural/ linguistic adaptations. Psychologists worldwide, however, are striving to develop culturally sensitive and linguistically accurate translations of existing English version instruments. They are also developing measures for particular national and ethnic populations.


  1. /Egisdottir, S., Gerstein, L. H., & Qinarba§, D. C. (2007, October 9). Methodological issues in cross-cultural counseling research: Equivalence, bias and translations. The Counseling Psychologist. Retrieved from http://tcp.sagepub.com/content/early/2007/10/09/0011000007305384.abstract
  2. Brislin, R. W. (1986). The wording and translation of research instruments. In W. J. Lonner & J. W. Berry (Eds.), Field methods in cross-cultural research (pp. 137-164). Beverly Hills, CA: Sage.
  3. Byrne, B. M. (2004). Testing for multigroup invariance using AMOS graphics: A road less traveled. Structural Equation Modeling: A Multidisciplinary Journal, 11, 272-300.
  4. Cleary, T. A., & Hilton, T. L. (1968). An investigation of item bias. Educational and Psychological Measurement, 28, 61-75.
  5. Hambleton, R. K. (2001). The next generation of the ITC test translation and adaptation guidelines. European Journal of Psychological Assessment, 17, 164-172.
  6. Hambleton, R. K., & de Jong, J. H. A. L. (2003). Advances in translating and adapting educational and psychological tests. Language Testing, 20, 127-134.
  7. Little, T. D. (2000). On the comparability of constructs in cross-cultural research: A critique of Cheung and Rensvold. Journal of Cross-Cultural Psychology, 31, 213-219.
  8. Lonner, W. J. (1985). Issues in testing and assessment in cross-cultural counseling. The Counseling Psychologist, 13, 599-614.
  9. van de Vijver, F. J. R., & Hambleton, R. K. (1996). Translating tests: Some practical guidelines. European Psychologist, 1, 89-99.
  10. van de Vijver, F. J. R., & Leung, K. (1997). Methods and data analysis for cross-cultural research. Thousand Oaks, CA: Sage.

See also: