May 2023 ~ Bodosoft

Establishing Causal Links

CAUSAL STUDIES

The search for causal explanations is of central importance in every area of scientific research. The first step in understanding something often involves speculating about what its cause or causes might be and then finding a way to test those speculations. In the last few years, for example, there has been a dramatic increase in the number of American children who are obese. What factors might be responsible for this increase? Too much fast food? Too little exercise? Some other factor? Some combination of factors? Causal experiments or, as they are usually called, causal studies are the main tool by which researchers confront such questions.

The design of causal studies and the assessment of their results present researchers with a series of problems over and above those we have encountered in thinking about how to design an experiment. The first stems from the fact that causation is not an all-or-nothing matter. Not every subject exposed to a cause will necessarily yield the effect. To establish a causal link, what must be shown is that significantly different levels of the effect obtained in those exposed and not exposed to the cause. And this can be a problem. Causal studies generally involve limited number of subjects. How can researchers be sure that a difference in levels of effect in their study groups is not due to the fact that relatively large differences often occur by chance in relatively small groups? How can they be sure, that is, that whatever effect a study uncovers is due to the suspected cause, and not random chance?

Second, effects are rarely associated with a single causal factor. How, then, can researchers be sure that other causal factors have not influenced the results? This problem is all the more severe because experimental and control subjects will usually be drawn from a population that may have been exposed to these other factors.

Finally, causal research cannot always be undertaken in a way that conforms to the model of a tightly controlled experiment outlined in Chapter 4. Researchers cannot, for example, investigate the possible link between diet and obesity in children by exposing young children to unhealthy food. Some other way must be found to test for this link.

Causal studies typically investigate the impact of suspected causes on the members of large populations, as in the example above. Often they will deal with issues of some considerable practical importance, like the problem of obesity in American children. For this reason, causal studies are the most widely covered of all science stories in the popular media. Near the end of the chapter we will turn out attention to the ways in which causal studies are handled in the mass media. Unfortunately, media coverage of causal research leaves a lot to be desired. Facts needed to make sense of a study will often be omitted and study outcomes oversimplified. And more often than not, these shortcomings result from a failure to appreciate the issues we will consider in this chapter.

RULING OUT CHANCE

In 2001 a study was undertaken to determine whether St. John’s wort, a popular herbal remedy, can counter the effects of depression. The study involved 200 women, most in their early 40s, who had suffered from major depression for two years, though none were classified as being severely depressed. The women were assigned at random to receive either an extract of St. John’s wort or a placebo. Those taking the St. John’s wort began with three tablets a day, each containing a commonly recommended dosage. If they did not improve after four weeks, the dose was increased to four tablets. After eight weeks, the patients were evaluated by psychiatrists who did not know whether the patients were in the experimental or control groups. Twenty-seven percent of those who took St John’s wort showed marked improvement compared with 19% in the placebo group. Is this difference large enough to show that St. John’s wort is an effective treatment for depression?

Before we can answer this question, we need to know something about the thinking involved in estimating the accuracy of samples taken at random from large populations. This is because the subjects in a causal study are a sample. In the St. John’s wort study, the 200 hundred subjects had been selected from the larger population composed of women suffering from depression. And the question at issue is whether, based on the study’s findings about this sample, we can conclude that St. John’s wort counters depression in the larger population. Or is it that the difference reported in the study is nothing more than the kind of chance variation often found in small samples?

Statisticians have been able to determine that the accuracy of a sample is, to a large extent, a function of sample size. The larger a sample, the greater the chances it will accurately mirror what is true of the population from which it was taken. This is a rough approximation of something statisticians call the law of large numbers. Flip a fair coin 10 times, and it should come down “heads” half the time. But in a series of 10 flips, chances are not all that bad that “heads” will come up three or four times or perhaps six or seven times. Now flip the same coin 1,000 times. Chances are very high that the coin will come down “heads” somewhere in the neighborhood of 500 times; chances are slim that we will get a result very much higher or lower than 500. (To see why, imagine what the chances would be of flipping a coin 1,000 times and getting only one or two “heads” as opposed to flipping it 10 times with the same result.)

This fact about samples accounts for a notion that is indispensable in estimating the chances that a sample is accurate—margin of error. Imagine that a poll has just been taken of registered voters in your state. Five hundred voters were selected at random, telephoned, and asked if they intend to vote in the upcoming election. 52% said yes. In reporting this result, the pollsters will probably say something like “this sample has a margin of error of +/− (plus or minus) 4%.” As a general rule, what this means is that if a sample of this size (500) were taken 20 times, in 19 of the 20 samples − 95% of the samples − the outcome would be within 4%, one way or the other, of the result obtained in the sample actually taken. So, in the case of our poll, there is a 95% chance that between 48% and 56% (52% +/− our 4% margin of error) of all registered voters will vote in the upcoming election.

Suppose instead that the sample had involved 1,500 voters with the same results. Remember, the larger a random sample, the greater the chances its results willbe accurate.What this meansisthat as samplesizeincreases,themargin oferror decreases. For a sample of 1,500, the margin of error is about +/− 2%. Chances are 19 out of 20 that something between 50% and 54% of the registered voters will turn out in the next election.

Table 5.1 gives the margins of error for a number of common sample sizes. In each case, there is a 95% chance that the sample outcome will reflect that of the population from which it was taken. Our choice of the 95% confidence level is somewhat, though not entirely, arbitrary. For example, we could just as easily have given margins of error at a lower confidence level, say, the 80% level. As you might suspect, the margins of error for this lower confidence level would be smaller. Ifwe are willing to tolerate a greater chance that we are wrong, we can venture a more accurate estimate. Table 5.1 tells us that in a sample of 100, for example, we can be 95% sure that the outcome will be within +/−10% of the sample outcome. Were we to restrict ourselves to a conclusion we could be 80% sure would be true, the margin of error would shrink to about 7% either way. However, most sampling is based on the 95% confidence level. Unless we have information to the contrary, it is a safe bet that a reported margin of error is at this level.

To assess the results of a causal study, we will need to make use of what we have found out about margin of error. As we noted earlier, experimental and control groups are, in a sense, samples. The 200 women in the St. John’s wort study are representative of the larger population of all people who fit the profile of the study subjects: women in their 40s who suffer from moderate levels of depression. (Unless there is good reason to think this particular group is unusually susceptible to bouts of depression, the population can perhaps be

T A B L E 5.1

Sample Size Approximate Margin of Error(%)*

25 +/−22 50 +/−14 100 +/−10 250 +/−6

500 +/−4

1000 +/−3

1500 +/−2.5

2000 +/−2

*The interval surrounding the actual sample outcome containing 95% of all possible sample outcomes.

expanded to include all people who suffer moderate depression.) So the issue we must resolve is whether the difference in the outcomes for samples that make up the experimental and control group are large enough to indicate a causal link and not just the kind of random statistical variation associated with sampling.

In the St. John’s wort study, 27% of 100 experimental subjects improved as did 19% of the control subjects. Is this difference enough? Look back to Table 5.1. The margin of error for samples of 100 is about +/− 10%. This tells us there is a 95% chance that in the population from which the sample was taken, somewhere between 17% and 37% could be expected to improve. In the population corresponding to the control group, somewhere between 9% and 29% would improve. Figure 5.1 shows that there is considerable overlap between these two intervals. This tells us is that chances are quite high that the difference we have discovered is due to random statistical fluctuations in the sampling process. This result does not mean that there is no link between the suspected causal agent and the effect we are testing. It is entirely possible that a causal link exists but that the level of effect is too small to measure using

Control Group		Experimental Group
9%	29%	Experimental Group
9%	17%	37%

F I G U R E 5.1

groups of this size. What we can conclude, however, is that this particular experiment has not conclusively established such a link. Were the difference between levels of effect in our two groups to have been 20% or more, we would have concluded that the difference is due to something other than the random statistical fluctuations associated with sampling. Quite possibly, it is due to the fact that St. John’s wort can prevent depression.

Interestingly enough, the difference in levels of effect found in the St. John’s wort study would have been sufficient to suggest a causal link if the experimental and control groups had been larger! Imagine if the study had been carried out on groups containing 1,000 subjects each. Table 5.1 tells us the margin of error for groups of this size is +/− 3%. The corresponding intervals are represented in Figure 5.2. Note there is a clear gap between the two intervals. It is not unusual for causal researchers to expand study sizes when their initial evidence for a causal link is tentative. A borderline result may become much less ambiguous if it can be shown to persist as larger samples are investigated. Nor is it unusual for researchers to combine the results of several small studies in an attempt to create groups of sufficient size to suggest a link that may be less clearly indicated in any of the individual studies. This approach is called meta-analysis. The findings of any such analysis are at best tentative, since their experimental and control groups come from many studies, and thus are likely to minimize differences in the studies over which they range.

Causal experiments do not always involve experimental and control groups of the same size. Even where the groups differ in size, we set minimal levels of difference in much the same way. Suppose, for example, that we have an experimental group of 50 subjects and a control group of 100. In constructing our intervals we need only make sure to work with the proper margins of error, which will be different in each case. Since we are working with percentages, we should encounter no difficulty in comparing the intervals.

When the results of causal experiments are reported, researchers often speak of differences that are or are not statistically significant. A difference in the outcome of two samples will be statistically significant when there is little or no overlap between the confidence intervals for the experimental and control groups. Thus a difference that is statistically significant is one which is highly unlikely to be due to normal sample fluctuations; chances are slim that two groups, chosen at random, would accidentally differ by the amount we observed in our experiment. Conversely, a result that is not statistically significant suggests there is a great deal of overlap and that the observed difference in levels of effect may well be due to random sample fluctuations.

As we have noted, causal research is nearly always conducted at the 95% confidence level. A statistically significant result, then, is one that would occur by chance only one time in twenty. In working with the results of a study done at this confidence level, Table 5.1 can help us to decide whether a result is or is not statistically significant. But a note of caution is in order here. The intervals in Table 5.1 can give us a rough approximation of whether a difference in experimental and control group outcomes is significant. However, they are a bit off. The percentage difference required to achieve statistical significance is a bit less than the difference suggested by Table 5.1. For example, a difference of just over 13% will be statistically significant for groups of 100 or so. (The required differences decrease even more when levels of the effect are very near to 0% or 100%.) Table 5.1 suggests that a 20% difference would be required. The amount of overestimation in Table 5.1 decreases as the size of experimental and control groups increases. Table 5.1 suggests a 6% difference is required to achieve statistical significance for samples of about 1,000, when in fact just over a 4% difference will do the trick. We can correct for the inaccuracy in Table 5.1 if we adopt the following rules of thumb in working with reported differences between experimental and control groups:

1. If there is no overlap in the intervals for the two, the difference is statistically significant.

2. If there is some overlap in the intervals (in the intervals have less than onethird of their values in common), the difference is probably statistically significant. The greater the overlap, the smaller the chances the difference is significant.

3. If there is a good deal of overlap (more than one-third of all values), the difference is probably not statistically significant.

In the jargon of the causal researcher, failure to establish a causal link is often called a failure to reject the null hypothesis. The null hypothesis is simply the claim that there is no difference between levels of effect in the real populations from which the samples were taken. An experiment that succeeds in establishing a large enough difference in levels of effect between experimental and control groups will often be said to reject the null hypothesis. But in the study of St. John’s wort, such a difference has not been observed. On the basis of the study, in other words, we cannot reject the null hypothesis.

Imagine now that we are about to design a causal study. Does A cause B in Cs? We select a number of Cs, assign them at random into experimental and control groups, and administer A to the experimental subjects and a placebo to the controls. How large of a difference in levels of B should we expect to find at the end of the study if there is a causal link? The answer now should be clear. Whatever it takes to allow us to reject the null hypothesis for samples of the size with which we are working.

MULTIPLE CAUSAL FACTORS

As a veteran teacher with years of experiences observing students, I’m convinced that students who attend class regularly generally do better on tests that do those who attend sporadically. But then personal observation can be misleading. Maybe I have just remembered those good test takers who always came to class, since I would like to think my teaching makes some difference. Is there really a causal link between my teaching and the performance of my students? We can determine this by doing a test. I will teach two courses in the same subject next semester, each containing 100 students. The only difference between the two courses will be that in the first, attendance will be mandatory, while in the second it will be voluntary. All material to be tested will be covered either in the textbook or in lecture notes to be supplied to all students. Course grades will be based on a single, comprehensive final exam given to all students in both courses. Suppose now that we have performed this experiment, and at the end of the term we discover a statistically significant difference between the test scores of the two groups. The experimental group, the group required to attend, scored much higher, on average, than the control group, most of whom took advantage of the attendance policy and rarely appeared in class. To ensure accuracy we have excluded the five highest and lowest scores from each group and the average difference remains statistically significant.

Despite the care we have taken in designing our experiment, it nonetheless suffers from a number of shortcomings. Perhaps the most obvious is the fact that it involves no control of factors other than attendance which might influence test scores. One such factor, obviously, is the amount of time that each subject studies outside of class. Remember, tests were based solely on material available to all subjects. What if a much higher percentage of the subjects in the experimental group than the control group spent considerable time preparing for the final? If this is the case, we would expect the experimental group to do better on the final but for reasons having little to do with class attendance.

The way to avoid this sort of difficulty is by matching within the experimental and control groups for factors, other than the suspected cause, which may contribute to the level of the effect. Matching involves manipulating subjects in an attempt to ensure that all factors that may contribute to the effect are equally represented in the two groups. There are several ways of matching. One is simply to make sure that all other contributing factors are equally represented within both groups. This we might accomplish in our experiment by interviewing the students beforehand to determine the number of hours on average studied per week. Presuming we can find an accurate way of getting this information, we can then disqualify students from one or the other of our groups until we have equal numbers of good, average, and poor studiers in both groups. Another way of matching is to eliminate all subjects who exhibit a causal factor other than that for which we are testing. Suppose we discover that a few students in each group are repeating the course. We might want to remove them altogether from our study.

The final way to match is to include only subjects who exhibit other possible causal factors. We might do this by restricting our study to students, all of whom study roughly the same amount each week. If all of our experimental and control subjects have additional factors which contribute to the effect in question, the factor for which we are testing should increase the level of the effect in the experimental group, provided that it is actually a causal factor. Matching in this last way can be problematic if there is any chance that the effect may be caused by a combination of factors. Thus we may end up with an experiment which suggests that A causes B in Cs when, in point of fact, it is A in combination with some other factor which causes B in Cs.

By matching within our two groups we can frequently account for causal factors other than the factor we are investigating. However, there is a way that unwanted causal factors can creep into an experiment which matching will not prevent. We must be on guard against the possibility that our subjects will themselves determine whether they are experimental or control subjects. Imagine, for example, a student who has enrolled in the course that requires attendance but then hears from a friend about the course that does not require attendance. It seems at least likely that poor students will opt for the course that requires less. Thus, we may find that poor students have a better chance of ending up in the control section rather than in the experimental section. We could, of course, control for this possibility by making sure students do not know the attendance policy prior to enrolling and by allowing no movement from course to course. Another problem we might have here is that poor students in the experimental group, upon hearing of the attendance policy, might drop out, again leaving us with an experimental group not well matched to the control group. In any event, it is worth taking whatever precautions are possible, in designing a causal experiment, to insure that subjects have no way to influence the composition of the experimental and control groups.

RANDOMIZED, PROSPECTIVE, AND RETROSPECTIVE CAUSAL STUDIES

Randomized Causal Studies. The causal studies we have discussed so far have several things in common. Each began with a set of subjects who, prior to the study, were very much alike. (In the second study—the one about student test performance—matching was used to make sure both groups were similar in makeup.) In particular, none had been exposed to the suspected causal agent. In both, subjects were randomly assigned to experimental and control groups. Only then were experimental subjects exposed to the suspected cause. Randomized studies, as studies of this sort are called, conform nicely to the criteria for a decisive experiment discussed in Chapter 4. Any pronounced differences in levels of effect that emerge over the course of either study would be highly likely if there is a causal link and highly unlikely otherwise.

The great advantage of randomized studies is that they are capable of providing unequivocal evidence for a causal link. But they have several disadvantages as well. First, they tend to be quite expensive, particularly if it is necessary to work with large groups of subjects. Second, unless the suspected effect follows reasonably soon after exposure to the casual agent, randomized studies may take a great deal of time to carry out.

In 1981, a large-scale randomized study was begun to determine whether low doses of aspirin could decrease the risk of myocardial infarction (heart attack). The study was undertaken by a team of investigators from Harvard Medical School and a Harvard teaching hospital. Over 260,000 male U.S. physicians between the ages of 40 and 84 were invited to participate. Of the nearly 60,000 who agreed to participate, about 26,000 were excluded because they reported a history of heart problems. 33,223 willing and eligible physicians were enrolled in a “run-in” phase in which all were given low-dose aspirin. After 18 weeks, they were sent a questionnaire asking about their health status, any side effects of the aspirin, compliance with study protocols, and their willingness to continue in the trial. About 11,000 changed their mind, reported a reason for exclusion or did not reliably take their aspirin. The remaining 22,071 physicians were then randomly assigned to receive either an aspirin or an aspirin placebo. Follow-up questionnaires were sent after six and twelve months and then yearly. Though the study was scheduled to run until 1995, it was stopped after only six years because it was clear that aspirin had a significant effect on the risk of a first myocardial infarction.

The details of this story attest to the amount of time, effort, and expense that may be required to undertake an ambitious randomized causal study. The selection of candidates and subsequent efforts at matching alone took nearly two years. The study then took another six years to yield decisive results. And the entire project required millions of dollars in funding from four National Institutes of Health grants.

Time and expense are not the only impediments to randomized studies. Many suspected causal links cannot be subjected to randomized testing without putting their subject at undue risk. Do high rates of cholesterol in the blood cause heart disease? Imagine what a randomized experiment might involve. We might begin with a large number of young children. Having divided them at random into two groups, we will train one group to eat and drink lots of fatty, starchy, and generally unhealthy foods of the sort we suspect may be associated with high levels of cholesterol. I’m sure you can see the problem. Not coincidentally, much medical research is carried out on laboratory animals precisely because we tend to have much less hesitation about administering potentially hazardous substances to members of nonhuman species. But there is another approach, one that allows a more humane use of human and animal subjects: work with subjects who have been exposed to suspected causes and their effects. Such studies fall into two broad categories, called prospective and retrospective studies.

QUICK REVIEW 5.1 Randomized Studies

Population from which experimental and control groups are drawn, not

yet exposed to suspected cause.

Experimental Group Control Group Selected at random. Selected at random.

Experimental Group Control Group

All are exposed to the suspected None are exposed to cause the suspected cause.

Prospective Causal Studies. Prospective studies begin with subjects who have already been exposed to the causal agent under investigation. Over the term of the study these experimental subjects are compared to control subjects who have not been exposed to the suspected cause. One of the most well-known prospective studies began in 1976 at Harvard Medical School. 122,000 nurses, aged 30 to 55, agreed to fill out a questionnaire every two years about diseases and heath-related topics including smoking, hormone use, and menopausal status. More than 90% of the respondents still answer a questionnaire every other year, providing information about what they eat, what medicines they take, what illness they have had, and whether they drink, smoke, exercise, or take vitamins, among other things. At various points in the study, blood samples have been obtained by sending a request and a few collection supplies to the nurses, something that could never have been done with the general public. Over the years researchers have been able to identify risk factors for diseases such as breast, colon, and lung cancer, diabetes, and heart disease. Among other things, researchers have discovered that those nurses who regularly take vitamin E have significantly lower rates of heart disease and those who have undergone hormone replacement therapy have significantly higher rates of breast cancer. As the study subjects age, researchers hope to be able to determine the extent to which diet and exercise lead to a longer, healthier life.

Even the best of prospective studies run up against one major stumbling block. At the onset of any prospective study, subjects in the experimental group have already been exposed to the suspected causal factor but as we have found, most things can be caused by a variety of factors. It is always possible that other causal factors are responsible for some part of the effect in both the experimental and control groups. For example, in the study above, nurses who took vitamin E had lower rates of heart disease. But it may be that those who take vitamin E tend to take good care of themselves by exercising, watching their weight, and eating a healthy diet, all factors that contribute to heart disease. And this is the problem. By concentrating on a single causal factor in the selection process, we leave open the possibility that whatever difference in levels of effect we observe in our two groups may be due in part to other factors. This, of course, is precisely where prospective studies differ from randomized studies. By randomly dividing subjects into experimental and control groups prior to administering the suspected cause, we greatly decrease the chances that other factors will account for differences in level of effect. In prospective studies it is always possible that other factors will come into play precisely because we begin with subjects who may have been exposed to other causal factors.

Matching can be used to control for extraneous causal factors in prospective studies. Suppose we find, in the long-term nurses study, that about 30% more experimental than control subjects exercise regularly. We can easily subtract some subjects from our experimental group or add some to the control group to achieve similar percentages of this obvious causal factor. It is not an oversimplification to say that the reliability of a prospective study is in direct proportion to the degree to which such matching is successful. Thus, in assessing the results of a prospective study, we need to know what factors have been controlled for via matching. In addition, it is always wise to be on the lookout for other factors that might influence the study’s outcome yet which have not been controlled for. In general, a properly done prospective study can provide some strong indication of a causal link, though not as strong as that provided by a randomized study.

In some respects prospective studies offer advantages over randomized causal studies. For one thing, prospective studies require much less direct manipulation of experimental subjects and thus tend to be easier and less expensive to carry out. Their principle advantage, however, lies in the fact that they can involve very large groups, as in the doctors and nurses studies discussed above. Causal factors often result in differences in level of effect that are so small as to require large samples to detect reliably. Moreover, greater size alone increases the chances that the samples will be representative with respect to other causal factors. This is crucial when an effect is associated with several causal factors. If a number of factors cause B in Cs, we increase our chances of accurately representing the levels of these other factors in our two groups as we increase their size. In addition, prospective studies allow us to study potential causal links we would not want to investigate in randomized studies. For example, the nurses study discussed above uncovered a link between hormone replacement therapy and breast cancer. No researcher who suspected such a link would undertake a randomized experiment that would involve exposing women to this potential hazard. A study that involves merely tracking women who are already undergoing replacement therapy is much less objectionable.

Retrospective Causal Studies. Retrospective studies begin with two groups, our familiar experimental and control groups, but the two are composed of subjects who do and do not have the effect in question. The study then involves looking into the subjects’ backgrounds in an attempt to uncover different levels of the potential causal factor in the two groups.

The following retrospective study investigated the effects of lead on children. A government health survey of 4,000 U.S. children between the ages

QUICK REVIEW 5.2 Prospective Studies

Population from which experimental and control groups are drawn.

Preliminary Preliminary

Experimental Group Control Group

All have been exposed to the None have been exposed to

Modified so that other potential causal factors are equally represented in both groups.

of 4 and 15, done between 1999 and 2002, included 135 children with attention deficit hyperactivity disorder (ADHA). Blood tests were done on all 4,000. Among the 135 children with ADHA, blood lead levels of more than two micrograms per deciliter occurred at a much greater rate than in the other children. The ADHA children were four times more likely to have elevated lead levels in their blood, leading researchers to conclude that exposure to lead may cause ADHA.

Even the best of retrospective studies can provide only weak evidence for a causal link. This is because, in retrospective studies, it is exceedingly difficult to control for other potential causal factors. Subjects are selected because they either do or do not have the effect in question, so other potential causal factors may automatically be built into our two groups. A kind of backwards matching is sometimes possible in retrospective studies. In the study above, if other sources of increased lead level can be identified, some attempt can be made to insure that these other factors are equally represented in the two groups. However, even if the two groups can be configured so that they exhibit similar levels of other suspected causes, we have at most very tentative evidence for the causal link in question. The process of re-configuring the two groups may leave us with a distorted picture of the extent to which the causal factor under investigation is responsible for the effect. That the two groups now appear to be alike with respect to other causal factors is, thus, largely because they are contrived to appear that way.

The particular study we are discussing suffers from an unusual weakness in the selection process for retrospective studies. Essentially, retrospective studies involve looking for something common in the background of subjects who exhibit an effect: in this case, elevated lead levels in the blood. It is at least a

QUICK REVIEW 5.3 Retrospective Studies

Population from which experimental

Experimental Group Control Group

All exhibit the suspected None exhibit the suspected effect.effect.

Sometimes modified after all data are collected to equalize the occurrence of other potential causal factors in the two groups.

possibility that cause and effect have been reversed! In the ADHA study, researchers noted that children with ADHA are more likely than others to eat old leaded paint chips or inhale leaded paint dust because of their hyperactivity. It may be that ADHA is in part responsible for the increased lead levels.

All we are in a position to conclude, as the result of a retrospective study, is that we have looked into the background of subjects who have a particular effect and we have found that a suspected cause occurs more frequently than in subjects who do not have the effect in question. Whether the effect is due to the suspected cause or whether cause and effect are reversed may be difficult to say even when pains are taken to control for other potential causal factors.

One final limitation of retrospective studies is that they provide no way of estimating the level of difference of the effect being studied. The very design of retrospective studies insures that 100% of the experimental group, but none of the control group, will have the effect. In the ADHA study, it was reported that ADHA children were “four times more likely” to have elevated lead levels in their blood. No doubt this difference is statistically significant. This tells us the difference is probably not due to chance. But it does not tell us that those children with elevated blood levels are “four times more likely” to have ADHA. Though this difference is striking, it tells us nothing about the efficacy of the suspected cause. By contrast, a prospective study—in which the experimental subjects had levels of the suspected effect four times higher than control subjects—would provide much stronger evidence of a causal link. Here, the evidence would be strengthened if other possible causes could be ruled out by matching prior to the onset of the study.

Because of these difficulties, retrospective studies are best regarded as a tool for uncovering potential causal links. The major advantage to retrospective studies, by contrast with randomized and prospective studies, is that they can be carried out quickly and inexpensively since they involve little more than careful analysis of data that is already available. And sometimes alacrity is of the essence. Imagine that there has been a rash of cases of food poisoning in your community. Before much of anything can be done, health authorities need some sense of what might be causing the problem. Interviews with the victims reveal that most had eaten at a particular restaurant within the last few days. This bit of retrospectively gained information may provide just the clue needed to get at the cause or causes of the problem.

READING BETWEEN THE LINES

The results of causal research are reported in specialized scientific journals. Typically an article will include full information about the design of the experiment, the results, and a complete statistical analysis where appropriate. Many medical journals, like The Journal of the American Medical Association and The New England Journal of Medicine, also provide full disclosure of the funding sources of the research. Conclusions will be carefully qualified, and the article will probably contain a brief history of other relevant research. When research uncovers a result which may have an impact on the general public, it will often be reported in the popular media in newspaper, magazines, the Internet, and on television. And it is here—in the mass media—that most of us encounter the findings of causal research.

Unfortunately, the popular media tend to do a poor job of reporting the outcomes of causal research. Media reports will often leave out crucial information, no doubt in the name of brevity; A 20- or 30-page journal article usually will be covered in a few paragraphs. Such reports tend also to dispatch with the kind of careful qualifications that normally accompany the original write-up of the results. For these reasons, it is important to learn to read between the lines of popular reports if we are to make sense of the research on which they are based.

Here, for example, is the complete text of a newspaper story about an important piece of causal research:

Lithium, which is widely prescribed for manic-depressive disorders, may be the first biologically effective drug treatment for alcoholism, according to studies at St. Luke’s Medical Center. The new evidence indicates that the drug appears to have the unique ability to act on the brain to suppress an alcoholic’s craving for alcohol. The St. Luke’s study involved 84 patients, ranging from 20 to 60 years of age, who had abused alcohol for an average of 17 years. Eighty-eight percent were male. Half the patients were given lithium while the other half took a placebo, a chemically inactive substance. Seventy-five percent of the alcoholics who regularly took their daily lithium pills did not touch a drop of liquor for up to a year and a half during the follow-up phase of the experiment. This abstinence rate is at least 50% higher than that achieved by the best alcohol treatment centers one to five years after treatment. Among the alcoholics who did not take their lithium regularly, only 35% were still abstinent at the end of 18 months. Among those who stopped taking the drug altogether, all had resumed drinking by the end of six months. (Researchers tested the level of lithium in the blood of the subjects to determine if they were taking the drug regularly.)

Just what are we to make of this story and the research it describes? Is lithium effective in the treatment of alcoholism? (Note that the story begins by claiming that lithium “may be” the first effective treatment for alcoholism.) In trying to make sense of an article like this one, it is necessary to try to answer a number of questions, all based on our findings in this chapter:

What is the causal hypothesis at issue?

What kind of causal experiment is undertaken?

What crucial facts and figures are missing from the report?

Given the information you have at your disposal, can you think of any major flaws in the design of the experiment?

Given the information available, what conclusion can be drawn about the causal hypothesis?

Let’s consider again the news article about the lithium study, now in light of our five questions.

What is the causal hypothesis at issue? The hypothesis is that lithium suppresses the alcoholic’s craving for alcohol.

What kind of causal experiment is undertaken? Randomized. Subjects are divided into experimental and control groups prior to the experiment and only the experimental subjects are exposed to the suspected causal agent.

What crucial facts and figures are missing from the report? The passage gives us no information about what happened to the members of the control group. Nor does it tell us the number of subjects from the experimental group who “regularly took their daily lithium pills.” We know that 75% of these subjects did so, but this could be as few as three out of four. All we are told of the remaining members of the experimental group is that 35% remained abstinent and that some stopped taking the drug altogether. We are not told how many are in each of these subgroups. It is possible that the majority of experimental subjects did not remain abstinent. Given the information we have at our disposal, we just cannot say for sure, one way or the other. Though we are given no information about the control group, we are provided with some information against which to assess the results in the experimental group: we are told that the 75% abstinence rate is “at least 50% higher than that achieved by the best alcohol treatment centers one to five years after treatment.” However, we are not told whether the success rate for treatment centers is a percentage of people who entered treatment or people who completed treatment. If the former is the case, there is a strong possibility treatment centers have a higher rate of success than that established in the experiment. Once again, we can draw no conclusions since we are not provided with the key comparative information.

Given the information you have at your disposal, can you think of any major flaws in the design of the experiment? One possible flaw comes to mind. It may be that the subjects who continued to take their medication (lithium or placebo) throughout the entire 18 months of the experiment were more strongly motivated to quit drinking than the other subjects. And this may have influenced the outcome of the experiment. Precautions need to be taken to ensure either that no subjects lacked this motivation or that they were equally represented in experimental and control groups. Here, information about the results of the control group would be helpful. If roughly equal numbers of people dropped out of both groups, we would have some initial reason to think that we had controlled for motivation.

Given the information available, what conclusion can be drawn about the causal hypothesis? We can conclude very little particularly because we are given no information about what happened to the control group. This is not to say that the experiment itself warrants no conclusion about the possible link between lithium and alcoholism. However the report about the study with which we have been working has presented us with so little information that we can draw no conclusion.

Media accounts of causal studies may make reference to a possible causal mechanism. The article about lithium and alcoholism does mention one: lithium, it is hypothesized, may act on the brain in some way to suppress an alcoholic’s craving for alcohol. Though in this case the proposed mechanism is vague, a well-understood, well-established mechanism can provide an additional piece of information to suggest that a study is on to something.

Bodosoft

Bodosoft

Bodosoft

19 May 2023

CAUSAL STUDIES

RULING OUT CHANCE

Sample Size Approximate Margin of Error(%)*

MULTIPLE CAUSAL FACTORS

RANDOMIZED, PROSPECTIVE, AND RETROSPECTIVE CAUSAL STUDIES

READING BETWEEN THE LINES

Blog Archive

Contributors

Labels

Blog Archive

Another Blog

About