What is intelligence? That is hard to say. It seems to depend very much on who is asked. The subject has engaged thinkers for at least as long as people have been writing down their thoughts, and possibly for much longer. Intelligence-related terms are used all the time. If someone mentions a friend who is very smart, for example, the listener will surely have some general idea of what is meant. The concept of stupidity is equally intuitive. Coming to a real agreement on all the different things that entails, however, may prove much harder. Does being intelligent mean knowing a lot of facts? Does it mean being able to solve math problems quickly? Is fast reaction time important? Does a high score on an IQ insure survival for a week in the jungle with just a spear? Most introductory psychology textbooks attempt to provide a quick theoretical deﬁnition. Here’s a favorite: intelligence is “the capacity to understand the world and the resourcefulness to cope with its challenges.”
In psychological science it is very important to come up with an operational deﬁnition of a construct before studying it empirically. To operationally deﬁne something (or operationalize it) means to deﬁne it in a way that will allow its measurement. This principle has allowed psychologists to sidestep all the sticky philosophical arguments about the nature of intelligence by deﬁning it thusly: intelligence is what intelligence tests measure. That is the deﬁnition most research on intelligence that actually involves measurement has used, at any rate. This introduces both an elegant simplicity and an infuriating circularity to the argument, however. Consider the next logical question: what do intelligence tests measure? Intelligence, of course. What’s intelligence again? And so on.
Clearly, a better answer is needed to the question: What do intelligence tests really measure? Answering it may ﬁrst require a digression about psychological tests and how they work. A basic deﬁnition: a psychological test is an objective, standardized measure of a sample of behavior. Each portion of that deﬁnition is important, if intelligence tests are to be properly understood.
First, the term standardized: a standardized measure is a procedure that is carried out in exactly the same way every time somebody takes the test. This means that the instructions given must be the same, and all other test conditions such as time limit and type of location, as well as such subtleties as ambient temperature and lighting, should be kept constant to whatever extent that it is possible.
To say that it is objective means that scoring is just as standardized as the rest of the test conditions. Personal opinions and feelings of the person scoring the test must not be allowed to inﬂuence the score that is given. This means that the manual for administering the test has to be very speciﬁc about which answers are correct and which are not, so no personal judgment is involved.
The third part of the deﬁnition is probably the most important, and the most frequently forgotten: a test score is simply a sample of behavior. It is not a measure, in other words, of the person’s overall ability in all things, but rather it is just a measure of how that person was able to perform on a particular occasion, at a particular place and time. Anyone who has ever taken an exam with insufﬁcient sleep or under really noisy conditions is aware that a test may not always give a true measure of a person’s ability.
Given this deﬁnition, it seems as though a psychological test should be easy to devise; people try all the time. A few things separate the Wechsler Intelligence Scale for Children, Fourth Edition (WISC-IV), from the Cosmopolitan magazine’s “What kind of lover are you?” quiz, however. In addition to the criteria described above, to qualify as a good psychological test, an assessment instrument needs to establish reliability, validity, and adequate norms.
Test reliability actually means much the same thing as it does with people— consistency or dependability. If a person takes a test today that shows them to be well adjusted and normal, and the same test a week later reveals that person to be a likely serial killer, there is a reliability problem. Test-retest reliability is the most common measure, and it is just what it sounds like. It’s the mathematical correlation between the results of a test taken by a group of subjects at two different times. If a strong positive relationship exists between the two sets of scores, that means that people tended to do about the same both times, and therefore the test is reliable. Sometimes reliability is established using alternate forms instead: two different but equivalent versions of a test are constructed, and subjects then take both tests. If they perform about the same on both, the test is reliable. This one is actually fairly challenging to use, since it requires designing two versions of a test that really are just like each other. One other kind of reliability measure is widely used: split-half reliability. This is measured by dividing a test in half after people take it and comparing the score on the ﬁrst half of the items to the score on the second half. This one is also very tricky to use, and it is only useful with tests that are only trying to measure one thing, otherwise there would be many different and non-equivalent ways to split the test into two sets of items.
Another important gauge of a test’s usefulness is validity. This is often deﬁned as the extent to which a test measures what it claims to measure. A slightly more complex, yet far more precise deﬁnition comes from the Standards for Educational and Psychological Testing (a joint venture of the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education): “A test is valid to the extent that inferences made from it are appropriate, meaningful, and useful.” This deﬁnition is important because the primary purpose of psychological tests, especially intelligence tests, is to make inferences about people.
Several different kinds of validity are important for intelligence tests. Predictive validity refers to the extent to which a test can predict future behaviors. This is measured by examining the correlation between the test and a future criterion; for example, SAT scores are used to predict college grade point averages. This is also sometimes called criterion-related validity, and the criterion used is often the score on another test of the same construct. A new IQ test, for example, will not be taken it seriously if it is not shown that performance on the new test correlates highly with performance on a better-established test, like the WISC or Stanford-Binet. Content validity is the extent to which a test adequately samples the behaviors being measured. Consider a math class in which multiplication and long division have been the primary focus. Would a cumulative ﬁnal exam that only covers subtraction be a valid measure of whether students have learned the material?
Finally, construct validity refers to the extent to which scores on the test actually represent the desired theoretical construct. A construct is just a broad, vague psychological concept, such as leadership ability or intelligence; generally speaking, something too complex to really be measured by a single number. Construct validity can be established in a variety of ways, including research to determine whether intervention effects or developmental changes have effects on test scores that are consistent with theory. For example, if there is a new measure of depression and a group of depressed people takes the test both before and after a treatment that is already well established as effective, then depression scores on the test had better go down. If the scores don’t, it clearly isn’t measuring what it was meant to measure. Similarly, every major IQ test has a vocabulary subtest. If the same test is given to a group of second-graders and a group of sixth-graders, the older children should certainly get higher scores. If they do not, clearly something other than vocabulary is being measured.
Now, assume a person has just taken a new intelligence test, and the score is a 50. What does that mean? Until further information about the test is available, it means nothing at all. The only way to know what a 50 means is to give the test to a whole lot of people and make up a frequency distribution to see how most people do. If the mean (average) on the test is a 30, the person who scored 50 did extremely well. If the mean is a 70, that person has done rather poorly. The results from giving the test to a lot of people to give meaning to scores are called norms. Based on norms, the average score on both of the major intelligence tests for children is 100, and all other scores are judged accordingly.
Intelligence tests have been around for about a century. Alfred Binet (1857–1911) created the ﬁrst modern intelligence test in 1905, along with Herbert Simon. His reason for doing so is important to know: the Paris school system wanted to identify students who required remedial classes. At the time, special education (indeed, the idea of mental retardation itself) was very new, and France was far more proactive in seeking to help these children than the United States was during the same period. As French law came to require that children in need of special help should receive it, it became clear that an objective way of identifying those children was necessary (more objective, at least, than asking teachers to select the children and remove them to a different classroom). To put it more crudely, the test was designed to identify people of low intelligence, not to identify people of normal or high intelligence.
By 1916, however, there was an American version, the Stanford-Binet, written by Lewis Terman. From the original version to the present, the test has always yielded a single score, which is considered a measure of what is sometimes called g, or general intelligence. This is the famous IQ, and it stands for intelligence quotient. It used to really be a quotient: mental age over chronological age. For example, a ﬁve-year-old child with a mental age of ﬁve would have an IQ of 5/5, or 1. A ﬁve year old with a mental age of three would have an IQ of 3/5, or .6. To remove those pesky decimals and make the numbers easier to deal with, it became standard practice to multiply the result times 100, so the two foregoing examples would have IQs of 100 and 60, respectively. The Stanford-Binet no longer estimates mental age, but the term IQ is still widely used, despite being highly inaccurate. Also, the modern test produces more than just one score.
The Stanford-Binet test is now in its ﬁfth edition, and it follows a fairly complicated model of intelligence. The test is structured to measure ﬁve separate factors of intelligence in both verbal and nonverbal domains, making for a total of ten subtests. The ﬁve factors are: ﬂuid reasoning, knowledge, quantitative reasoning, visual-spatial processing, and working memory. In addition to ten subtest scores, therefore, the test provides ﬁve factor scores and the familiar Full Scale IQ, as well as separate verbal and nonverbal IQ scores. Whereas the original test was intended for children, the current Stanford-Binet is normed on a sample ranging from two-year-old children all the way up to eighty-ﬁve-year- old adults.
The Stanford-Binet is the modern descendant of the original test, but the most frequently given of all psychological tests is its chief rival, the WAIS-III (Wechsler Adult Intelligence Scales, Third Edition). The WAIS originated in David Wechsler’s belief that the Stanford-Binet, having been designed for children, was not the best test for adults. Wechsler was working with adult psychiatric patients at Bellevue hospital, and he quickly realized that the scoring system for the Binet test (and the early Stanford-Binet) simply made no sense with an adult population. While it may make sense to say that a sevenyear-old has a mental age of ﬁve, for example, it would be meaningless to say that a thirty-eight-year-old patient has a mental age of only thirty-ﬁve. Wechsler therefore published his own test in 1939, designed for adults and scored using what he called a deviation IQ rather than a calculated mental/chronological ratio.
Scores were compared to a set of norms for the person’s age, and the score was assigned according to where that person stood in comparison to other adults his or her own age. The WAIS is normed on a sample ranging in age from sixteen to eighty-nine. It’s appropriate for ages sixteen through seventy-two. Its eleven subtests are organized into two scales, verbal and performance. When it is scored, it produces, in addition to a full-scale IQ, a verbal IQ and a performance IQ.
Following the success of the WAIS, Wechsler designed his own child tests as well, producing the WISC (Wechsler Intelligence Scales for Children, now in its fourth edition) and the WPPSI (Wechsler Preschool and Primary Scales of Intelligence, now revised and known as WPPSI-R). The WISC-IV is meant for ages six to sixteen, and the WPPSI covers the age range from three to seven years. Both the WISC-IV and WPPSI-R are very similar structurally to the WAIS, with a few differences among the subtests to reﬂect the ability differences between the age groups; and they produce the same pattern of IQ scores: full scale, verbal, and performance.
In addition to the Stanford-Binet and Wechsler tests, there are many others, but they are used a small fraction of the number of times that those two are used each year. Some fairly solid intelligence tests assess infants, the best of which is probably the Bayley Scales of Infant Development-II. The Bayley can be used with extremely young children, with norm tables that begin with the age of one month and range up to forty-two months. It is an ingenious test that mostly consists of engaging the child in age-appropriate play and carefully observing the child for a wide range of developmental milestones. IQ is actually quite unstable in the ﬁrst few years of life, and although high scores on the Bayley are no guarantee of anything, unusually low scores can accurately predict later test scores or school performance; for instance, the test does an excellent job of detecting children who will test in the mentally retarded range of IQ scores later in life.
Although these tests are all clearly based in a particular theoretical notion of intelligence involving multiple factors, they still primarily serve to produce a single number, usually still called IQ, that is intended to provide a global measure of general ability. In this way they largely fail, at least in how the scores are used, to reﬂect the wide diversity of opinions visible in the work of various intelligence theorists over the years. They do, however, reﬂect the views of one inﬂuential thinker, Charles Spearman, who proposed that intelligence consisted largely of a single underlying general ability that he called g. Beyond Spearman’s inﬂuential view lie many other points of view.
The very ﬁrst attempts to measure intelligence, predating Binet, come from the late nineteenth century, a period sometimes called the Brass Instruments era of psychology, named for the serious machine-shop skills necessary among psychologists who had to build all their own apparatus rather than purchase products that didn’t exist yet. The period was also when Sir Francis Galton and his American disciple, James McKeen Cattell, believed intelligence required keen sensory abilities and a fast reaction time, and they sought to measure intelligence indirectly by taking various physical measurements, especially reaction time.
This sensory keenness approach to intelligence largely died out around the time that Binet published his ﬁrst test, but it still has some modern adherents such as Arthur Jensen, who uses a device called the Reaction-Time/Movement-Time (RT-MT) apparatus in his attempts at culture-reduced study of intelligence. The device consists of a set of small buttons with lights next to them, arrayed in a fan shape on a console, with a single button at the base of the fan. The subject rests a hand on the button at the bottom and waits. When a light comes on, the subject moves that hand to strike the button adjacent to the light as rapidly as possible. The device measures both reaction time (how long the subject takes to remove the hand from the ﬁrst button) and movement time (the interval between taking the hand off the ﬁrst button and pressing the second one). Jensen claims fairly high correlations between these measurements and traditional measures of intelligence, but this technique has not caught on widely.
Raymond Cattell produced an inﬂuential intelligence theory in the 1940s by proposing that, rather than a single g factor, there are two major kinds of intelligence, which he called ﬂuid and crystallized intelligence. Fluid intelligence is nonverbal and fairly immune to cultural bias, consisting primarily of a person’s inherent capacity to learn and solve problems. This is the kind of intelligence required when a task calls for adaptation to a new situation. Crystallized intelligence consists of what a person has learned, and thus consists of knowledge rather than problem-solving skills (though ﬂuid intelligence is required to increase crystallized intelligence). Much of the research inspired by Cattell’s theory is concerned with the alleged decline of intelligence in old age. A large number of studies suggest that, while ﬂuid intelligence may decline with age, crystallized intelligence does not.
The Russian psychologist A. R. Luria proposed a very different theory that also relied on the concept of two different kinds of mental processing. Based on his studies of brain-injured soldiers, he decided there were two different kinds of mental activity reﬂected in intelligence tests: simultaneous and successive processing. Simultaneous processing is what occurs when a task requires the execution of several different mental operations at the same time. Spatial tasks are a good example. When drawing, a person has to grasp the overall shape of what is being drawn, while also drawing its components individually. Successive processing, in which only one mental operation is carried out at a time, makes sense for solving math problems, but would be a disaster as an approach to drawing even a simple shape like a triangle. The person would have to draw lines of predetermined speciﬁc lengths, at certain angles to each other, and simply hope that they lined up.
Some theories have been somewhat more complicated, however. In the 1930s, Thurstone proposed a set of seven primary mental abilities (PMAs), and Guilford in the late 1960s proposed his structure-of-intellect model, which could have as many as 150 distinct factors. Theories proposing multiple types of intelligence to replace the single general-factor approach have enjoyed a renaissance of sorts in recent years with Robert Sternberg’s triarchic model of intelligence. The triarchic model proposes the existence of three independent kinds of intelligence: analytical, creative, and practical. Analytical intelligence is the stuff tested by standard IQ tests that present clearly deﬁned problems with a single acceptable “right” answer. Creative intelligence concerns whether a person reacts adaptively in novel situations and generates new ideas. Sternberg has criticized the standard tests for completely failing to test this at all. Practical intelligence, the other area neglected by the usual tests, is the intelligence used in dealing with everyday problems, which often have no single right answer; but rather a multitude of possible solutions, some better than others.
In 1983 Howard Gardner decided to outdo Sternberg by increasing the number of kinds of intelligence further in his theory of multiple intelligences. According to Gardner, there are at least seven relatively independent kinds of human intelligence: linguistic, logical-mathematical, spatial, musical, bodily kinesthetic, interpersonal, and intrapersonal. He further states that the number has not been deﬁnitively established, and there may be more. He has recently proposed several additional candidates: naturalistic, spiritual, and existential intelligence. While these have so far proved very difﬁcult to measure, Gardner’s ideas have proven quite popular with educators, because they represent an escape from reliance on single test scores that may fail to value some of the very real skills and aptitudes that distinguish one human being from another.
- Fancher, R. E. The Intelligence Men: Makers of the IQ Controversy. New York: W. W. Norton, 1985.