Introduction to Validity
When designing and using tests and other methods of assessing people, it is important that the test and its use is valid.
"Validity of a test refers to the degree to which a test measures and what it purpose to measured".
"The validity of a test is defined as the extent to which the test measures, what it design to measure".
Aiken Psy Testing & Assessment
Validity is based on empirical evidence.
This gap can be caused by two particular circumstances:
- The design of the test is insufficient for the intended purpose, and
- The test is used in a context or fashion which was not intended in the design.
In psychometric, validity has a particular application known as test validity: "the degree to which evidence and theory support the interpretations of test scores" ("as entailed by proposed uses of tests").
In the area of scientific research design and experimentation, validity refers to whether a study is able to scientifically answer the questions it is intended to answer.
In clinical fields, the validity of a diagnosis and associated diagnostic tests may be assessed.
It is generally accepted that the concept of scientific validity addresses the nature of reality and as such is an epistemological and philosophical issue as well as a question of measurement. The use of the term in logic is narrower, relating to the truth of inferences made from premises.
"Validity the best available approximation to the truth of a given proposition, inference, or conclusion".
The first thing we have to ask is: "validity of what?" When we think about validity in research, most of us think about research components. We might say that a measure is a valid one, or that a valid sample was drawn, or that the design had strong validity. But all of those statements are technically incorrect. Measures, samples and designs don't 'have' validity -- only propositions can be said to be valid. Technically, we should say that a measure leads to valid conclusions or that a sample enables valid inferences, and so on. It is a proposition, inference or conclusion that can 'have' validity.
We make lots of different inferences or conclusions while conducting research. Many of these are related to the process of doing research and are not the major hypotheses of the study. Nevertheless, like the bricks that go into building a wall, these intermediate process and methodological propositions provide the foundation for the substantive conclusions that we wish to address. For instance, virtually all social research involves measurement or observation. And, whenever we measure or observe we are concerned with whether we are measuring what we intend to measure or with how our observations are influenced by the circumstances in which they are made. We reach conclusions about the quality of our measures -- conclusions that will play an important role in addressing the broader substantive issues of our study. When we talk about the validity of research, we are often referring to these to the many conclusions we reach about the quality of different parts of our research methodology.
We subdivide validity into four types. Each type addresses a specific methodological question. In order to understand the types of validity, you have to know something about how we investigate a research question. Because all four validity types are really only operative when studying causal questions, we will use a causal study to set the context.
The figure shows that there are really two realms that are involved in research. The first, on the top, is the land of theory. It is what goes on inside our heads as researchers. It is were we keep our theories about how the world operates. The second, on the bottom, is the land of observations. It is the real world into which we translate our ideas -- our programs, treatments, measures and observations. When we conduct research, we are continually flitting back and forth between these two realms, between what we think about the world and what is going on in it. When we are investigating a cause-effect relationship, we have a theory (implicit or otherwise) of what the cause is (the cause construct). For instance, if we are testing a new educational program, we have an idea of what it would look like ideally. Similarly, on the effect side, we have an idea of what we are ideally trying to affect and measure (the effect construct). But each of these, the cause and the effect, has to be translated into real things, into a program or treatment and a measure or observational method. We use the term operationalization to describe the act of translating a construct into its manifestation. In effect, we take our idea and describe it as a series of operations or procedures. Now, instead of it only being an idea in our minds, it becomes a public entity that anyone can look at and examine for themselves. It is one thing, for instance, for you to say that you would like to measure self-esteem (a construct). But when you show a ten-item paper-and-pencil self-esteem measure that you developed for that purpose, others can look at it and understand more clearly what you intend by the term self-esteem.
Now, back to explaining the four validity types. They build on one another, with two of them (conclusion and internal) referring to the land of observation on the bottom of the figure, one of them (construct) emphasizing the linkages between the bottom and the top, and the last (external) being primarily concerned about the range of our theory on the top. Imagine that we wish to examine whether use of a World Wide Web (WWW) Virtual Classroom improves student understanding of course material. Assume that we took these two constructs, the cause construct (the WWW site) and the effect (understanding), and operationalized them -- turned them into realities by constructing the WWW site and a measure of knowledge of the course material. Here are the four validity types and the question each addresses:
- Reliability and validity
Validity is often assessed along with reliability - the extent to which a measurement gives consistent results.
An early definition of test validity identified it with the degree of correlation between the test and a criterion. Under this definition, one can show that reliability of the test and the criterion places an upper limit on the possible correlation between them (the so-called validity coefficient). Intuitively, this reflects the fact that reliability involves freedom from random error and random errors do not correlate with one another. Thus, the less random error in the variables, the higher the possible correlation between them. Under these definitions, a test cannot have high validity unless it also has high reliability. However, the concept of validity has expanded substantially beyond this early definition and the classical relationship between reliability and validity need not hold for alternative conceptions of reliability and validity.
Within classical test theory, predictive or concurrent validity (correlation between the predictor and the predicted) cannot exceed the square root of the correlation between two versions of the same measure — that is, reliability limits validity.
Types of Validity
1. Content validity
Content Validity is based on the extent to which a measurement reflects the specific intended domain of content.
(Carmines & Zeller, 1991, p.20).
Content validity is a non-statistical type of validity that involves “the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured” (Anastasi & Urbina, 1997 p. 114). For example, does an IQ questionnaire have items covering all areas of intelligence discussed in the scientific literature?
Content validity evidence involves the degree to which the content of the test matches a content domain associated with the construct. For example, a test of the ability to add two numbers should include a range of combination of digits. A test with only one-digit numbers, or only even numbers, would not have good coverage of the content domain. Content related evidence typically involves subject matter experts (SME's) evaluating test items against the test specifications.
A test has content validity built into it by careful selection of which items to include (Anastasi & Urbina, 1997). Items are chosen so that they comply with the test specification which is drawn up through a thorough examination of the subject domain. Foxcraft et al. (2004, p. 49) note that by using a panel of experts to review the test specifications and the selection of items the content validity of a test can be improved. The experts will be able to review the items and comment on whether the items cover a representative sample of the behavior domain.
Content validity is illustrated using the following examples: Researchers aim to study mathematical learning and create a survey to test for mathematical skill. If these researchers only tested for multiplication and then drew conclusions from that survey, their study would not show content validity because it excludes other mathematical functions. Although the establishment of content validity for placement-type exams seems relatively straight-forward, the process becomes more complex as it moves into the more abstract domain of socio-cultural studies. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.
A test has content validity if it sufficiently covers the area that it is intended to cover. This is particularly important in ability or attainment tests that validate skills or knowledge in a particular domain.
In psychometric, content validity (also known as logical validity) refers to the extent to which a measure represents all facets of a given social construct. For example, a depression scale may lack content validity if it only assesses the affective dimension of depression but fails to take into account the behavioral dimension. An element of subjectivity exists in relation to determining content validity, which requires a degree of agreement about what a particular personality trait such as extraversion represents. A disagreement about a personality trait will prevent the gain of a high content validity.
Content validity is related to face validity, but they should not be confused. The latter is not validity in the technical sense; it refers, not to what the test actually measures, but to what it appears superficially to measure. Face validity pertains to whether the test "looks valid" to the examinees who take it, the administrative personnel who decide on its use, and other technically untrained observers. Content validity requires more rigorous statistical tests than face validity, which only requires an intuitive judgment. Content validity is most often addressed in academic and vocational testing, where test items need to reflect the knowledge actually required for a given topic area (e.g., history) or job skill (e.g., accounting). In clinical settings, content validity refers to the correspondence between test items and the symptom content of a syndrome.
One widely used method of measuring content validity was developed by C. H. Lawshe. It is essentially a method for gauging agreement among raters or judges regarding how essential a particular item is. Lawshe (1975) proposed that each of the subject matter expert raters (SMEs) on the judging panel respond to the following question for each item: "Is the skill or knowledge measured by this item 'essential,' 'useful, but not essential,' or 'not necessary' to the performance of the construct?" According to Lawshe, if more than half the panelists indicate that an item is essential, that item has at least some content validity. Greater levels of content validity exist as larger numbers of panelists agree that a particular item is essential. Using these assumptions, Lawshe developed a formula termed the content validity ratio:
CVR = (ne - N/2)/(N/2)
CVR= content validity ratio, ne= number of SME panelists indicating "essential", N= total number of SME panelists. This formula yields values which range from +1 to -1; positive values indicate that at least half the SMEs rated the item as essential. The mean CVR across items may be used as an indicator of overall test content validity.
The minimum values of the CVR to ensure that agreement is unlikely to be due to chance can be found in the following table:
Content under-representation occurs when important areas are missed. Construct-irrelevant variation occurs when irrelevant factors contaminate the test.
- Representation validity
2. Face Validity
Face validity is that the test appears to be valid. This is validated using common-sense rules, for example that a mathematical test should include some numerical elements.
A test can appear to be invalid but actually be perfectly valid, for example where correlations between unrelated items and the desired items have been found. For example, successful pilots in WW2 were found to very often have had an active childhood interest in flying model planes.
A test that does not have face validity may be rejected by test-takers (if they have that option) and also people who are choosing the test to use from amongst a set of options.
Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably? Unlike content validity, face validity does not depend on established theories for support (Fink, 1995).
Face validity is an estimate of whether a test appears to measure a certain criterion; it does not guarantee that the test actually measures phenomena in that domain. Indeed, when a test is subject to faking (malingering), low face validity might make the test more valid.
Face validity is very closely related to content validity. While content validity depends on a theoretical basis for assuming if a test is assessing all domains of a certain criterion (e.g. does assessing addition skills yield in a good measure for mathematical skills? - To answer this you have to know, what different kinds of arithmetic skills mathematical skills include ) face validity relates to whether a test appears to be a good measure or not. This judgment is made on the "face" of the test, thus it can also be judged by the amateur.
Face validity is a starting point, but should NEVER be assumed to be provably valid for any given purpose, as the "experts have been wrong before--the Malleus Malificarum (Hammer of Witches) had no support for its conclusions other than the self-imagined competence of two "experts" in "witchcraft detection," yet it was used as a "test" to condemn and burn at the stake perhaps 100,000 women as "witches."
3. Criterion Related Validity
Criterion-related validity is like construct validity, but now relates the test to some external criterion, such as particular aspects of the job.
There are dangers with the external criterion being selected based on its convenience rather than being a full representation of the job. For example an air traffic control test may use a limited set of scenarios.
Criterion validity evidence involves the correlation between the test and a criterion variable (or variables) taken as representative of the construct. In other words, it compares the test with other measures or outcomes (the criteria) already held to be valid. For example, employee selection tests are often validated against measures of job performance (the criterion), and IQ tests are often validated against measures of academic performance (the criterion).
If the test data and criterion data are collected at the same time, this is referred to as concurrent validity evidence. If the test data is collected first in order to predict criterion data collected at a later point in time, then this is referred to as predictive validity evidence.
Criterion related validity, also referred to as instrumental validity, is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid.
For example, imagine a hands-on driving test has been shown to be an accurate test of driving skills. By comparing the scores on the written driving test with the scores from the hands-on driving test, the written test can be validated by using a criterion related strategy in which the hands-on driving test is compared to the written test.
In psychometric, criterion validity is a measure of how well one variable or set of variables predicts an outcome based on information from other variables, and will be achieved if a set of measures from a personality test relate to a behavioral criterion that psychologists agree on. A typical way to achieve this is in relation to the extent to which a score on a personality test can predict future performance or behavior. Another way involves correlating test scores with another established test that also measures the same personality characteristic.
Criteria or concrete validity is the extent to which the measures are demonstrably related to concrete criteria in the "real" world. This type of validity is often divided into "concurrent" and "predictive" subtypes. The term "concurrent validity" is reserved for demonstrations relating a measure to other concrete criteria assessed simultaneously. "Predictive validity" refers to the degree to which any measure can predict future or independent past events. These variables are often represented as “intermediate” and “ultimate” criteria. For example, let us say we are conducting a study on success in college. If we find out there is a high correlation between student grades in high-school math classes and their success in college (which can be measured by many possible variables), we would say there is high criterion-related validity between the intermediate variable (grades in high-school math classes) and the ultimate variable (success in college). Essentially, the grades students received in high-school math can be used to predict their success in college.
- Concurrent validity
The two measures are taken at the same time. This is in contrast to predictive validity, where one measure occurs earlier and is meant to predict some later measure
Concurrent validity refers to the degree to which the operationalization correlates with other measures of the same construct that are measured at the same time. Returning to the selection test example, this would mean that the tests are administered to current employees and then correlated with their scores on performance reviews.
Concurrent validity is measured by comparing two tests done at the same time, for example a written test and a hands-on exercise that seek to assess the same criterion. This can be used to limit criterion errors.
- Predictive validity
The validity coefficient
The validity coefficient is calculated as a correlation between the two items being compared, very typically success in the test as compared with success in the job.
A validity of 0.6 and above is considered high, which suggests that very few tests give strong indications of job performance.
Predictive validity refers to the degree to which the operationalization can predict (or correlate with) with other measures of the same construct that are measured at some time in the future. Again, with the selection test example, this would mean that the tests are administered to applicants, all applicants are hired, their performance is reviewed at a later time, and then their scores on the two measures are correlated.
In psychometric, predictive validity is the extent to which a score on a scale or test predicts scores on some criterion measure.
For example, the validity of a cognitive test for job performance is the correlation between test scores and, for example, supervisor performance ratings. Such a cognitive test would have predictive validity if the observed correlation were statistically significant.
Predictive validity shares similarities with concurrent validity in that both are generally measured as correlations between a test and some criterion measure. In a study of concurrent validity the test is administered at the same time as the criterion is collected. This is a common method of developing validity evidence for employment tests: A test is administered to incumbent employees, then a rating of those employees' job performance is obtained (often, as noted above, in the form of a supervisor rating). Note the possibility for restriction of range both in test scores and performance scores: The incumbent employees are likely to be a more homogeneous and higher performing group than the applicant pool at large.
In a study of predictive validity, the test scores are collected first; then at some later time the criterion measure is collected. Here the example is slightly different: Tests are administered, perhaps to job applicants, and then after those individuals work in the job for a year, their test scores are correlated with their first year job performance scores. Another relevant example is SAT scores: These are validated by collecting the scores during the examinees senior year and high school and then waiting a year (or more) to correlate the scores with their first year college grade point average. Thus predictive validity provides somewhat more useful data about test validity because it has greater fidelity to the real situation in which the test will be used. After all, most tests are administered to find out something about future behavior.
As with many aspects of social science, the magnitude of the correlations obtained from predictive validity studies is usually not high. A typical predictive validity for an employment test might obtain a correlation in the neighborhood of r=.35. Higher values are occasionally seen and lower values are very common. Nonetheless the utility (that is the benefit obtained by making decisions using the test) provided by a test with a correlation of .35 can be quite substantia.
Concurrent validity and predictive validity are two types of criterion-related validity. The difference between concurrent validity and predictive validity rests solely on the time at which the two measures are administered. Concurrent validity applies to validation studies in which the two measures are administered at approximately the same time. For example, an employment test may be administered to a group of workers and then the test scores can be correlated with the ratings of the workers supervisors taken on the same day or in the same week. The resulting correlation would be a concurrent validity coefficient
4. Construct Validity
Underlying many tests is a construct or theory that is being assessed. For example, there are a number of constructs for describing intelligence (spatial ability, verbal reasoning, etc.) which the test will individually assess.
Constructs can be about causes, about effects and the cause-effect relationship.
If the construct is not valid then the test on which it is based will not be valid. For example, there have been historical constructs that intelligence is based on the size and shape of the skull.
Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to "define" intelligence in order to reach an acceptable level of construct validity.
To understand whether a piece of research has construct validity, three steps should be followed. First, the theoretical relationships must be specified. Second, the empirical relationships between the measures of the concepts must be examined. Third, the empirical evidence must be interpreted in terms of how it clarifies the construct validity of the particular measure being tested.
(Carmines & Zeller, p. 23)
Construct validity refers to the extent to which operationalization of a construct (e.g. practical tests developed from a theory) do actually measure what the theory says they do. For example, to what extent is an IQ questionnaire actually measuring "intelligence"?
Construct validity evidence involves the empirical and theoretical support for the interpretation of the construct. Such lines of evidence include statistical analysis of the internal structure of the test including the relationships between responses to different test items. They also include relationships between the test and measures of other constructs. As currently understood, construct validity is not distinct from the support for the substantive theory of the construct that the test is designed to measure. As such, experiments designed to reveal aspects of the causal role of the construct also contribute to construct validity evidence.
- Convergent Validity
Convergent validity refers to the degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with.
- Divergent or Discriminant Validity
Discriminant validity describes the degree to which the operationalization does not correlate with other operationalization that it theoretically should not be correlated with.
Construct Validity: Assuming that there is a causal relationship in this study, can we claim that the program reflected well our construct of the program and that our measure reflected well our idea of the construct of the measure?
In simpler terms, did we implement the program we intended to implement and did we measure the outcome we wanted to measure? In yet other terms, did we operationalize well the ideas of the cause and the effect? When our research is over, we would like to be able to conclude that we did a credible job of operationalizing our constructs -- we can assess the construct validity of this conclusion.
Construct validity refers to the degree to which inferences can legitimately be made from the operationalization in your study to the theoretical constructs on which those oprationalization were based. Like external validity, construct validity is related to generalizing. But, where external validity involves generalizing from your study context to other people, places or times, construct validity involves generalizing from your program or measures to the concept of your program or measures. You might think of construct validity as a "labeling" issue. When you implement a program that you call a "Head Start" program, is your label an accurate one? When you measure what you term "self esteem" is that what you were really measuring?
I would like to tell two major stories here. The first is the more straightforward one. I'll discuss several ways of thinking about the idea of construct validity, several metaphors that might provide you with a foundation in the richness of this idea. Then, I'll discuss the major construct validity threats, the kinds of arguments your critics are likely to raise when you make a claim that your program or measure is valid. In most research methods texts, construct validity is presented in the section on measurement. And, it is typically presented as one of many different types of validity (e.g., face validity, predictive validity, concurrent validity) that you might want to be sure your measures have. I don't see it that way at all. I see construct validity as the overarching quality with all of the other measurement validity labels falling beneath it. And, I don't see construct validity as limited only to measurement. As I've already implied, I think it is as much a part of the independent variable -- the program or treatment -- as it is the dependent variable. So, I'll try to make some sense of the various measurement validity types and try to move you to think instead of the validity of any operationalization as falling within the general category of construct validity, with a variety of subcategories and subtypes.
The second story I want to tell is more historical in nature. During World War II, the U.S. government involved hundreds (and perhaps thousands) of psychologists and psychology graduate students in the development of a wide array of measures that were relevant to the war effort. They needed personality screening tests for prospective fighter pilots, personnel measures that would enable sensible assignment of people to job skills, psychophysical measures to test reaction times, and so on. After the war, these psychologists needed to find gainful employment outside of the military context, and it's not surprising that many of them moved into testing and measurement in a civilian context. During the early 1950s, the American Psychological Association began to become increasingly concerned with the quality or validity of all of the new measures that were being generated and decided to convene an effort to set standards for psychological measures. The first formal articulation of the idea of construct validity came from this effort and was couched under the somewhat grandiose idea of the nomological network. The nomological network provided a theoretical basis for the idea of construct validity, but it didn't provide practicing researchers with a way to actually establish whether their measures had construct validity. In 1959, an attempt was made to develop a method for assessing construct validity using what is called a multitrait-multimethod matrix, or MTMM for short. In order to argue that your measures had construct validity under the MTMM approach, you had to demonstrate that there was both convergent and discriminant validity in your measures. You demonstrated convergent validity when you showed that measures that are theoretically supposed to be highly interrelated are, in practice, highly interrelated. And, you showed discriminant validity when you demonstrated that measures that shouldn't be related to each other in fact were not. While the MTMM did provide a methodology for assessing construct validity, it was a difficult one to implement well, especially in applied social research contexts and, in fact, has seldom been formally attempted. When we examine carefully the thinking about construct validity that underlies both the nomological network and the MTMM, one of the key themes we can identify in both is the idea of "pattern." When we claim that our programs or measures have construct validity, we are essentially claiming that we as researchers understand how our constructs or theories of the programs and measures operate in theory and we claim that we can provide evidence that they behave in practice the way we think they should. The researcher essentially has a theory of how the programs and measures related to each other (and other theoretical terms), a theoretical pattern if you will. And, the researcher provides evidence through observation that the programs or measures actually behave that way in reality, an observed pattern. When we claim construct validity, we're essentially claiming that our observed pattern -- how things operate in reality -- corresponds with our theoretical pattern -- how we think the world works. I call this process pattern matching, and I believe that it is the heart of construct validity. It is clearly an underlying theme in both the nomological network and the MTMM ideas. And, I think that we can develop concrete and feasible methods that enable practicing researchers to assess pattern matches -- to assess the construct validity of their research. The section on pattern matching lays out my idea of how we might use this approach to assess construct validity.
5. Experimental validity
The validity of the design of experimental research studies is a fundamental part of the scientific method, and a concern of research ethics. Without a valid design, valid scientific conclusions cannot be drawn. There are several different kinds of experimental validity.
6. Conclusion Validity
Conclusion Validity: In this study, is there a relationship between the two variables?
In the context of the example we're considering, the question might be worded: in this study, is there a relationship between the WWW site and knowledge of course material? There are several conclusions or inferences we might draw to answer such a question. We could, for example, conclude that there is a relationship. We might conclude that there is a positive relationship. We might infer that there is no relationship. We can assess the conclusion validity of each of these conclusions or inferences.
One aspect of the validity of a study is statistical conclusion validity - the degree to which conclusions reached about relationships between variables are justified. This involves ensuring adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures. Conclusion validity is only concerned with whether there is any kind of relationship at all between the variables being studied; it may only be a correlation.
7. Internal validity
Researchers should be concerned with both Internal and External validity.
Internal Validity: Assuming that there is a relationship in this study, is the relationship a causal one?Internal validity refers to the rigor with which the study was conducted (e.g., the study's design, the care taken to conduct measurements, and decisions concerning what was and wasn't measured) and (2) the extent to which the designers of a study have taken into account alternative explanations for any causal relationships they explore (Huitt, 1998). In studies that do not explore causal relationships, only the first of these definitions should be considered when assessing internal validity.
Internal validity is an inductive estimate of the degree to which conclusions about causal relationships can be made (e.g. cause and effect), based on the measures used, the research setting, and the whole research design. Good experimental techniques, in which the effect of an independent variable on a dependent variable is studied under highly controlled conditions, usually allow for higher degrees of internal validity than, for example, single-case designs.
- History, the specific events occurring between the first and second measurements in addition to the experimental variables
- Maturation, processes within the participants as a function of the passage of time (not specific to particular events), e.g., growing older, hungrier, more tired, and so on.
- Testing, the effects of taking a test upon the scores of a second testing.
- Instrumentation, changes in calibration of a measurement tool or changes in the observers or scorers may produce changes in the obtained measurements.
- Statistical regression, operating where groups have been selected on the basis of their extreme scores.
- Selection, biases resulting from differential selection of respondents for the comparison groups.
- Experimental mortality, or differential loss of respondents from the comparison groups.
- Selection-maturation interaction, etc. e.g., in multiple-group quasi-experimental designs.
8. External validity
External validity refers to the extent to which the results of a study are generalizable or transferable. (Most discussions of external validity focus solely on generalizability; see Campbell and Stanley, 1966. We include a reference here to transferability because many qualitative research studies are not designed to be generalized.)
External validity concerns the extent to which the (internally valid) results of a study can be held to be true for other cases, for example to different people, places or times. In other words, it is about whether findings can be validly generalized. If the same research study was conducted in those other cases, would it get the same results?
A major factor in this is whether the study sample (e.g. the research participants) are representative of the general population along relevant dimensions. Other factors jeopardizing external validity are:
- Reactive or interaction effect of testing, a pretest might increase the scores on a post-test
- Interaction effects of selection biases and the experimental variable.
- Reactive effects of experimental arrangements, which would preclude generalization about the effect of the experimental variable upon persons being exposed to it in non-experimental settings
- Multiple-treatment interference, where effects of earlier treatments are not erasable.
We are likely to make some claims that our research findings have implications for other groups and individuals in other settings and at other times. When we do, we can examine the external validity of these claims.
Notice how the question that each validity type addresses presupposes an affirmative answer to the previous one. This is what we mean when we say that the validity types build on one another.
For any inference or conclusion, there are always possible threats to validity -- reasons the conclusion or inference might be wrong. Ideally, one tries to reduce the plausibility of the most likely threats to validity, thereby leaving as most plausible the conclusion reached in the study. For instance, imagine a study examining whether there is a relationship between the amount of training in a specific technology and subsequent rates of use of that technology. Because the interest is in a relationship, it is considered an issue of conclusion validity. Assume that the study is completed and no significant correlation between amount of training and adoption rates is found. On this basis it is concluded that there is no relationship between the two. How could this conclusion be wrong -- that is, what are the "threats to validity"? For one, it's possible that there isn't sufficient statistical power to detect a relationship even if it exists. Perhaps the sample size is too small or the measure of amount of training is unreliable. Or maybe assumptions of the correlational test are violated given the variables used. Perhaps there were random irrelevancies in the study setting or random heterogeneity in the respondents that increased the variability in the data and made it harder to see the relationship of interest. The inference that there is no relationship will be stronger -- have greater conclusion validity -- if one can show that these alternative explanations are not credible. The distributions might be examined to see if they conform with assumptions of the statistical test, or analysis conducted to determine whether there is sufficient statistical power.
The theory of validity, and the many lists of specific threats, provide a useful scheme for assessing the quality of research conclusions. The theory is general in scope and applicability, well-articulated in its philosophical suppositions, and virtually impossible to explain adequately in a few minutes. As a framework for judging the quality of evaluations it is indispensable and well worth understanding.
9. Ecological validity
Ecological validity is the extent to which research results can be applied to real life situations outside of research settings. This issue is closely related to external validity but covers the question of to what degree experimental findings mirror what can be observed in the real world (ecology = the science of interaction between organism and its environment). To be ecologically valid, the methods, materials and setting of a study must approximate the real-life situation that is under investigation.
Ecological validity is partly related to the issue of experiment versus observation. Typically in science, there are two domains of research: observational (passive) and experimental (active). The purpose of experimental designs is to test causality, so that you can infer A causes B or B causes A. But sometimes, ethical and/or methodological restrictions prevent you from conducting an experiment (e.g. how does isolation influence a child's cognitive functioning?). Then you can still do research, but it's not causal, it's correlational. You can only conclude that A occurs together with B.
Both techniques have their strengths and weaknesses. To get an experimental design you have to control for all interfering variables. That's why you often conduct your experiment in a laboratory setting. While gaining internal validity (excluding interfering variables by keeping them constant) you lose ecological validity because you establish an artificial lab setting.
On the other hand with observational research you can't control for interfering variables (low internal validity) but you can measure in the natural (ecological) environment, at the place where behavior normally occurs.
10. Diagnostic validity
In clinical fields such as medicine, the validity of a diagnosis, and associated diagnostic tests or screening tests, may be assessed.
In regard to tests, the validity issues may be examined in the same way as for psychometric tests as outlined above, but there are often particular applications and priorities. In laboratory work, the medical validity of a scientific finding has been defined as the 'degree of achieving the objective' - namely of answering the question which the physician asks.An important requirement in clinical diagnosis and testing is sensitivity and specificity - a test needs to be sensitive enough to detect the relevant problem if it is present (and therefore avoid too many false negative results), but specific enough not to respond to other things (and therefore avoid too many false positive results).
In psychiatry there is a particular issue with assessing the validity of the diagnostic categories themselves. In this context:
- Content validity may refer to symptoms and diagnostic criteria;
- Concurrent validity may be defined by various correlates or markers, and perhaps also treatment response;
- Predictive validity may refer mainly to diagnostic stability over time;
- Discriminant validity may involve delimitation from other disorders.
- Distinct clinical description (including symptom profiles, demographic characteristics, and typical precipitants).
- Laboratory studies (including psychological tests, radiology and postmortem findings).
- Delimitation from other disorders (by means of exclusion criteria).
- Follow-up studies showing a characteristic course (including evidence of diagnostic stability).
- Family studies showing familial clustering.
These were incorporated into the Feighner Criteria and Research Diagnostic Criteria that have since formed the basis of the DSM and ICD classification systems.
Kendler in 1980 distinguished between:
- Antecedent validators (familial aggregation, premorbid personality, and precipitating factors).
- Concurrent validators (including psychological tests).
- Predictive validators (diagnostic consistency over time, rates of relapse and recovery, and response to treatment).
Kendell and Jablinsky (2003) emphasized the importance of distinguishing between validity and utility, and argued that diagnostic categories defined by their syndromes should be regarded as valid only if they have been shown to be discrete entities with natural boundaries that separate them from other disorders.
Kendler (2006) emphasized that to be useful, a validating criterion must be sensitive enough to validate most syndromes that are true disorders, while also being specific enough to invalidate most syndromes that are not true disorders. On this basis, he argues that a Robins and Guze criterion of "runs in the family" is inadequately specific because most human psychological and physical traits would qualify - for example, an arbitrary syndrome comprising a mixture of "height over 6 ft, red hair, and a large nose" will be found to "run in families" and be "hereditary", but this should not be considered evidence that it is a disorder. Kendler has further suggested that "essentialist" gene models of psychiatric disorders, and the hope that we will be able to validate categorical psychiatric diagnoses by "carving nature at its joints" solely as a result of gene discovery, are implausible.