by Keith Wright, Ph.D., director of psychometrics, The Enrollment Management Association
Given the heightened focus on standardized testing in today’s media, it is not surprising that misperceptions exist about what these tests are designed to do. One of the major misconceptions is that standardized tests are not fair and are biased against particular groups.
No one can argue that tests demonstrate variable performance by specific geographic, gender, racial, and/or ethnic groups. However, well-designed tests do not create differences; they only measure differences. From a testing/psychometric standpoint, these performance differences do not make a test “unfair” or “biased.” Those terms have very specific meanings in testing. This article describes how these terms have been defined and how organizations such as The Enrollment Management Association (EMA) and the Educational Testing Service (ETS) analyze tests for potential bias.
In any high-stakes decision process, the notion of fairness becomes paramount from both an ethical and a legal standpoint. There are many definitions of fairness in the field of testing. Here we adopt the concept defined by ETS: “Fairness requires treating people with impartiality regardless of personal characteristics such as gender, ethnicity, or ...disability. With respect to assessments, fairness requires that construct-irrelevant personal characteristics of test takers have no appreciable effect on test results or their interpretation” (ETS, 2002, p. 17). In other words, test takers’ performance on an assessment should be based on the abilities (constructs) being measured, but not based on their group membership (e.g., male vs. female, white vs. African American).
Bias vs. Impact
In contrast, test bias can be viewed as a systematic difference either between test takers or test items that should be equal (Camilli, 2006). Theoretically, if two groups of test takers are considered to be of equal proficiency on the construct being measured (equality is determined by total test score), they should have the same probability of answering the test item correctly. If one group has a higher or lower probability after being matched on ability, this could potentially be an indication of test bias. Here we need to make a distinction between bias and impact, where the latter is the observed difference of the average performance on a test or an item between two groups. Tests are often criticized for bias when there are observed score differences between two groups (e.g. male vs. female, white vs. African-American). This is NOT correct. Impact ≠ Bias.
Analyzing Potential Test Bias
When analyzing potential test bias between groups, the appropriate statistical tool that should be used is differential item functioning (DIF). DIF is “a difference in item performance between two comparable groups of examinees; that is, the groups that are matched with respect to the construct being measured by the test” (Dorans & Holland, 1993, p.35). Again, if two groups of test takers are considered to be of equal proficiency on the construct being measured, they should have the same probability of answering the test item correctly. If one group has a higher or a lower probability after being matched on ability, this item is functioning differentially for the two groups. This is not desirable in testing; strict guidelines are enforced to remove these DIF test items.
We can examine DIF by plotting the results in a graph like those at left. Figure 1 shows the percentage of people answering a particular item correctly, by total test score and by which group they belong to. We can see that people at any given test score answer this item correctly with the same frequency, regardless of which group they belong to. Thus, this item shows no differential item functioning.
In Figure 2, in contrast, we see an item that the focal (or majority) group is more likely than the reference (or underrepresented) group to answer correctly, even after we control for overall ability as measured by total test score. There is something about this item that is unrelated to total test score, but that is related to group membership. This item needs to be carefully examined.
It is important to note that professional organizations such as the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME) have been jointly developing standards related to fairness and other important areas for educational and psychological testing for many years (AERA, APA, & NCME, 1997). Many high-quality testing programs, including the SSAT, have adopted these fairness standards and incorporated them into their operational procedures related to testing to ensure that (a) appropriate tests are developed for the intended examinees; (b) tests are administered and scored precisely; and (c) test results are reported and interpreted correctly. In addition, content experts convene to review all test items that may have inappropriate terminology, stereotyping language, ethnocentrism, elitist or patronizing tone, or inflammatory material.
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1997). Standards for educational and psychological testing. Washington, DC: American Psychological Association
- Camilli, G. (2006). Test Fairness. In R. Brennan (Ed.), Educational Measurement (pp. 221-256). Westport, CT: Praeger Publishers.
- Dorans, N., & Holland, P. (1993). DIF Detection and description: Mantel-Haenszel and Standardization. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ: Erlbaum.
- Educational Testing Service. (2002). ETS Standards for Quality and Fairness. Princeton, NJ: Educational Testing Service.