A core interest of personnel psychology is whether some intervention in selection, training, or motivation relates to some criterion. A criterion is an evaluative standard that is used as a yardstick for measuring employees’ success or failure on the job. In many instances, the criterion of interest will be job performance, but a criterion could also be a particular attitude, ability, or motivation that reflects an operational statement of goals or desired outcomes of the organization. Implicit in this definition is that the criterion is a social construct defined by organization leaders who are responsible for formulating and translating valued organizational outcomes.

The Ultimate Criterion

Eventually, we are interested in predicting the ultimate criterion, that is, the full domain of employees’ performance, including everything that ultimately defines success on the job. Given the totality of this definition, the ultimate criterion remains a strictly conceptual construct that cannot be measured or observed. To approach it, however, and to describe the connection between the outcomes valued by the organization and the employee behaviors that lead to these outcomes, J. F. Binning and G. V. Barrett introduced the concept of behavior-outcome links. In practice, these links should be based on a thorough job analysis, in the form of either a job description that analyzes the actual job demands or a job specification that reveals the constructs required for good performance.

The Operational Criterion

The conceptual nature of the ultimate criterion requires practitioners to deduct and develop the criterion measure or operational criterion, an empirical measure that reflects the conceptual criterion as well as possible. Using this operational criterion as a proxy for the conceptual criterion of interest, the usual approach in personnel psychology is to establish a link between performance on a predictor and performance on the operational criterion as an indication of the predictor’s criterion-related validity. Operational criteria might include the following:

  • Objective output measures (e.g., number of items sold)
  • Quality measures (e.g., number of complaints, number of errors)
  • Employees’ lost time (e.g., occasions absent or late)
  • Trainability and promotability (e.g., time to reach a performance standard or promotion)
  • Subjective ratings of performance (e.g., ratings of knowledge, skills, abilities, personal traits or characteristics, performance in work samples, or behavioral expectations)
  • Indications of counterproductive behaviors (e.g., disciplinary transgressions, personal aggression, substance abuse, or voluntary property damage)

In practice, operational criteria should satisfy at least three independent requirements.

  1. Operational criteria must be relevant to the organization’s prime objectives. Although this may sound obvious in theory, in practice, criterion choices are often based on convenience (e.g., using data from performance records that are “lying around any-way”), habit, or copying what others have used. Though recorded output data such as sales volume might be easily accessible, they may represent a more suitable criterion measure for some organizations (e.g., car dealers striving for high momentary sales figures) than for others (e.g., car dealers living on good word-of-mouth propaganda resulting from superior customer service).
  2. Operational criteria must be sensitive in discriminating between effective and ineffective employees. This requires (a) a linkage between performance on the operational criterion and the employee’s actual performance on the job (i.e., the link between the operational and the conceptual criterion) and (b) variance among employees. Speed of production, for example, may be an unsuitable criterion in cases in which speed is restrained by the tempo of an assembly line; likewise, the number of radioactive accidents is likely to be low across all nuclear power plant engineers, making this theoretically relevant criterion practically useless.
  3. Operational criteria need to be practicable, as the best-designed evaluation system will fail if management is confronted with extensive data recording and reporting without seeing an evenly large return for their extra efforts.

The Criterion Problem

The criterion problem describes the difficulties involved in conceptualizing and measuring how the conceptual criterion of interest, a construct that is multidimensional and dynamic, can best be captured by an operational criterion. This problem, according to Binning and Barrett, is even bigger if the job analysis on which the criterion is based is of poor quality and if the link between the operational and the conceptual criteria has been weakly rationalized.

Criterion Deficiency and Contamination

Any operational criterion will suffer from at least one of two difficulties: First, criterion deficiency is a formidable and pervasive problem because operational criterion measures usually fail to assess all of the truly relevant aspects of employees’ success or failure on the job. Second, operational criteria may be contaminated because many of the measures are additionally influenced by other external factors beyond the individual’s control. One of the most persistent reasons a measured criterion is deficient is the multidimensionality of the ultimate criterion, which combines static, dynamic, and individual dimensionality. Two other reasons for both contamination and deficiency are the unreliability of performance and performance observation. Finally, reasons associated primarily with criterion contamination are stable biases influencing criterion evaluations.

Static dimensionality implies that the same individual may be high on one facet of performance but simultaneously low on another. Thus, although an employee may do terrific work in terms of classical task performance (i.e., activities that transform raw materials into the goods and services produced by the organization or that help with this process), the same employee may show relatively poor contextual or organizational citizenship behaviors (i.e., behaviors that contribute to the organization’s effectiveness by providing a good environment in which task performance can occur, such as volunteering, helping, cooperating with others, or endorsing and defending the organization to outsiders). Even more, the same individual might engage in counterproductive behaviors or workplace deviance (i.e., voluntary behaviors that violate organizational norms and threaten the well-being of the organization and its members, such as stealing, avoiding work, or spreading rumors about the organization).

Another aspect of static dimensionality addresses whether performance is observed under typical performance conditions (i.e., day-in, day-out, when employees are not aware of any unusual performance evaluation and when they are not encouraged to perform their very best) or under maximum performance conditions (i.e., short-term evaluative situations during which the instruction to maximize efforts is plainly obvious, such as work samples). The distinction is important because performance on the exact same task can differ dramatically between situations, not only in absolute terms but also in terms of employee ranking of performance. A likely reason is that typical versus maximum performance situations influence the relative impact of motivation and ability on performance.

Temporal or dynamic dimensionality implies that criteria change over time. This change may take any of three forms. First, the average of the criterion may change because performers, as a group, may grow better or worse with time on the job. Second, the rank order of scores on the criterion may change as some performers remain relatively stable in their performance while others highly increase or decrease their performance over time. Third, the validity of any predictor of performance may change over time because of changing tasks or changing subjects. The changing task model assumes that because of technological developments, the different criteria for effective performance may change in importance while individuals’ relative abilities remain stable (e.g., a pub starts accepting a new billing method and expects employees to know how to handle it). Alternatively, the changing subjects model assumes that it is not the requirements of the task but each individual’s level of ability that change over time (e.g., because of increased experience and job knowledge or decreasing physical fitness).

Individual dimensionality implies that individuals performing the same job may be considered equally good, but the nature of their contributions to the organization may be quite different. For example, two very different individuals working in the same bar may end up with the same overall performance evaluation if, for example, one of them does a great job making customers feel at home while the other one does a better job at keeping the books in order and the bills paid.

Criterion reliability, the consistency or stability with which the criterion can be measured over time, is a fundamental consideration in human resource interventions. However, such reliability is not always given. More precisely, there are two major sources of unreliability. Intrinsic unreliability results from personal inconsistency in performance, whereas extrinsic unreliability results from variability that is external to job demands or the individual, such as machine downtimes, the weather (e.g., in construction work), or delays in supplies, assemblies, or information (in the case of interdependent work) that may contaminate ratings at some times but not necessarily at others. Practically, there is little remedy for criterion unreliability except to search for its causes and sample and aggregate multiple observations over the time and over the domain to which one wants to generalize.

Besides lack of reliability in the criterion itself, Jack of reliability in the observation is another cause of discrepancy between the conceptual and the operationalized criterion—that is, the criterion measurement may result in very different results depending on how it is rated and by whom. Thus, objective data and ratings by supervisors, peers, direct reports, customers, or the employee may greatly diverge in their evaluations of an employee’s performance for diverse reasons. Although an employee’s direct supervisor may seem to be the best person available to judge performance against the organization’s goals, he or she may actually observe the employee only rarely and thus lack the basis to make accurate judgments. Such direct observation is more frequent among peers, yet relationships (e.g., friendships, coalitions, in-groups versus out-groups) among peers are likely to contaminate ratings. Direct reports, in contrast, primarily observe the evaluated individual in a hierarchical situation that may not be representative of the evaluated individual’s overall working situation, and, if they fear for their anonymity, they, too, may present biased ratings. For jobs with a high amount of client interaction, clients may have a sufficiently large observational basis for evaluations, yet they may lack the insight into the organization’s goals needed to evaluate the degree to which employees meet these goals.

Operational criteria can be contaminated because of diverse errors and biases. Error, or random variation in the measurement, lowers the criterion reliability and is best addressed through repeated collection of criterion data. Biases, however, represent a systematic deviation of an observed criterion score from the same individual’s true score. Because they are likely to persist across measures, biases may even increase statistical indications of the operational criterion’s reliability. If the same biases are equally related to the measure used to predict the criterion (e.g., performance in a specific personnel selection procedure), they may also increase the predictor’s statistical criterion-related validity, even though this validity is not based on the actual criterion of interest but on the bias persisting across measurements.

Among the diverse possible biases, biases resulting from knowledge of predictor information or group membership and biases in ratings are particularly prominent. If, for example, a supervisor has been involved in the selection of a candidate, his or her impression during the selection process is likely to influence the evaluation of this candidate’s later fit or performance on the job. Group membership may incur bias if the organization provides support to some groups but not others (e.g., management training for women only) or if different standards are established for different groups (e.g., quota promotions). Finally, biases in criterion ratings may result from personal biases or prejudices on the part of the rater, from rating tendencies (leniency, central tendency, or severity), or from the rater’s inability to distinguish among different dimensions of the criterion (halo). These effects will become more severe as employees’ opportunities to demonstrate their proficiency become more unequal and as the rater’s observation becomes more inaccurate.


  1. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478-494.
  2. Cascio, W. F. (1998). Applied psychology in human resource management (5th ed.). Upper Saddle River, NJ: Prentice Hall.
  3. Guion, R. M. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum.