Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It
Paul E. Meehl (1990).
Psychological Inquiry, 1(2), 108-141
https://doi.org/10.1207/s15327965pli0102_1 (pdf, UMN Meehl website)

Karl Popper Karl Popper, smart guy and Meehl's hero

Imre Lakatos Imre Lakatos, smart guy and Meehl's hero

The very short of it

When a theory’s predictions fail, we have to decide whether to abandon the theory wholesale or to fiddle with some secondary elements in order to save it. How do we decide whether it’s worth the trouble and which parts to change? It’s worth when the theory has previously made precise and unexpected predictions. Changes are admissible when they concern non-core parts and when they are derived from the core parts and they result in new predictions that pass new empirical tests. Social sciences have a hard time generating “worthy” theories because their predictions are usually neither precise nor unexpected. This is due to the crud factor and an inappropriate use of significance tests. Social scientists should abandon null hypothesis significance tests of directional predictions (dy/dx > 0), try to generate predictions that are better than directional without being point predictions (e.g. ranges or functional form), and focus on deriving consequences from numerical agreements arrived at from qualitatively diverse observational avenues.

The author’s own summary

The short version of the argument is on pp. 121-2 (start with last paragraph p. 121, stop before subsection “H0 Testing in Light of the Lakatos-Salmon Principle”):

Let the expression Lakatosian defense designate the strategy outlined by Lakatos in his constructive amendment of Popper, a strategy in which one distinguishes between the hard core of T [the theory that we want to test] and the protective belt. In my notation Lakatos’s protective belt includes the peripheral portions of T, plus the theoretical auxiliaries At, the instrumental auxiliaries Ai, the ceteris paribus clause Cp, the experimental conditions Cn, and finally the observations O1, O1. The Lakatos defense strategy includes the negative heuristic which avoids (he said forbids) directing the arrow of the modus tollens at the hard core. To avoid that without logical contradiction, one directs the arrow at the protective belt.

[…]

The tactics within the Lakatosian defensive strategy may vary with circumstances. As mentioned earlier, we may refuse to admit the falsifying protocol into the corpus, or raise doubts about the instrumental auxiliary, or challenge the ceteris paribus clause, or the theoretical auxiliaries, or finally, as a last ditch maneuver, question the peripheral portions of the substantive theory itself. Nobody has given clear-cut rules for which of these tactics is more rational, and I shall not attempt such a thing. At best, we could hope to formulate rough guidelines, rules of thumb, “friendly advice,” broad principles rather than rules (Dworkin, 1967). […]

When is it rational strategy to conduct a Lakatosian defense? Here we invoke the Lakatos principle. We lay down that it is not a rational policy to go to this much trouble with amendments of T or adjustments of auxiliaries unless the theory already has money in the bank, an impressive track record, and is not showing clear symptoms of a degenerating research program.

How does a theory get money in the bank—how does it earn an impressive track record? We rely on the basic epistemological principle that “If your aim is a causal understanding of the world, do not attribute orderliness to a damn strange coincidence.” […] We apply this maxim to formulate Salmon’s principle: that the way a theory gets money in the bank is by predicting observations that, absent the theory, would constitute damn strange coincidences. […] It does this by achieving a mixture of risky successes (passing strong Popperian tests) and near-misses, either of these being Salmonian damn strange coincidences.

My summary

Here I try to summarize the central argument. Hopefully, you will find it easier to skim the article with this summary under your belt. I omit those parts where Meehl discusses how science of science people should measure the “success” of theories.

The starting paradox is this: In the physical sciences, greater statistical power (better instruments or experimental design, more observations) makes it harder to pass significance tests (it becomes easier to detect differences between the measured and the predicted quantity); in the social sciences, greater statistical power makes it easier to pass significance tests (it becomes easier to reject H0 = “there is no effect”).

This creates the social scientist’s dilemma: either perform the null ritual (pretend that “A causes B” is equivalent to dy/dx > 0; pretend that anyone believes H0: dy/dx = 0; show that p(data|H0)<0.05; claim [implicitly] “this shows that A causes B”) or give up on improving our understanding of the world and get a job increasing ad revenue for billionaires.

The less flippant version of the dilemma is: “everything is correlated with everything” (the crud factor, see below); we do not have theories strong enough to generate predictions beyond dy/dx > 0; crud factor & dy/dx > 0 => many, many nonsense theories will pass significance tests.

Meehl thinks we need not throw out the baby with the bath water and we can distinguish between better and worse theories without using significance tests.

When we conduct an empirical test of a substantive theory, it is not equivalent to deciding whether the sentence “the apple is on the table” is true or false. Theories are more complex than that. Meehl suggests that the logical form of an empirical test of a substantive theory is:

(T + At + Ai + Cp + Cn) -> (O1O2)

In verbal form: If you believe (Theory + theoretical auxiliaries + instrumental auxiliaries + ceteris paribus + experimental conditions) then you must believe (when we observe O1 we will also observe O2). We’ll get to the meaning of the individual terms later. They roughly mean what you think they mean.

If it happens that we observe O1 but we do not observe O2, then we must reject the entire left-hand side. This is referred to as modus tollens (if you argue P->Q and then you observe Q but not P, it follows that P->Q is incorrect). I will use LHS as shorthand for the entire thing within parentheses left of ‘->’.

If we observe O1 but not O1, we have falsified the LHS in Popper-speak.

If we observe O1 and O2, we have corroborated the LHS in Popper-speak.

Strictly logically speaking, we have not “proved” or “confirmed” or whatever the LHS. So we do we bother? Because, as practising scientists, our publications make the (implicit) claim that, absent T, there is little reason to expect “if O1 then O2.” This is where the crud factor becomes a problem.

The proposition “if not for T, the statement ‘if O1, then O1’ is unlikely [implausible, nonsensical, whatever]” is more believable the more detailed T, O1, and O1 are. Meehl gives the following example (p. 110). A meteorological theory that predicts that it will rain next April is not very interesting. One that correctly predicts on which five days in April it will rain is more interesting. One which correctly predicts the exact amount it will rain on each of these five days is very interesting.

Now, in the real world, we rarely abandon T because the RHS did not pan out. More likely, we put the blame on some of the other elements and adjust them. Most of time, we even know that one or more of the LHS elements are not strictly speaking true. So, the question becomes: when is it worth it and appropriate to fiddle with the different parts of LHS->RHS, except for the core parts of T?

To answer this question, we must:

  • identify the core parts of T
  • decide if T is a “progressive” or “degenerate” research program
  • decide which parts of LHS->RHS can be changed
  • decide whether T has enough “money in the bank” to be worth the trouble

How to identify the core parts of T

I don’t know. Meehl only gives two examples from psychology (point 12, p. 112). He seems to suggest that domain experts will know which bits of a theory are essential and which are not. I’m not sure what a good social science example would be.

How to identify a “progressive” or “degenerate” research program

“Progressive” = when our ad hockery leads to new predictions, these new predictions pass new empirical tests, and our ad hockery mobilizes existing elements of our theory instead of unrelated external ones.

“Degenerate” = when an excessive amount or a bad kind of ad hockery is required to save T. Ad hockery is bad when it does not fulfill the three requirements above (new facts, pass tests, related to core T).

See point 11 on p. 111

How to decide what to change and how (good vs. bad ad hockery)

Which parts of LHS->RHS should we change? And how?

First, let’s clear up terminology.

  • At and Ai: theoretical and instrumental auxiliaries. Ai means beliefs/assumptions about our means of control (e.g. administering a treatment and ensuring compliance) and observation (e.g. measuring completed fertility via survey). At means supplemental theories about constructs relevant to T (e.g. humans have preferences and try to act in accordance to them). The demarcation between T, Ai, and At is not always clear.
  • Cp: ceteris paribus (“all else being equal”). In its strongest sense, it implies that there is nothing that influences O1 and O2 that is not containted in T. Obviously, that is not true. What we (implicitly) believe is “nothing else is at work except factors that are totally random and therefore subject to being dealt with by our statistical methods.” And even that might not be true or we might not really believe it (and even less often state it as an assumption).
  • Cn: experimental conditions. What was the exact protocol? Do we think it was run appropriately? Do we think it was reported honestly?

Most commonly, we adjust Cp and At. Especially if we “tested” T in different contexts or subgroups, and our test worked in some but not all. Basically, we say “oh it seems like there are some systematic [confounders, mediators, whatever] in this particular context/subgroup.” And then we move some bits from Cp into At or T.

One risk is that we introduce new constructs into At as we adjust Cp. Then, we have to explain why these new theories explain our failure in that one context/subgroup but do not change our predictions or results in the other contexts/subgroups. “We do not want to be guilty of gerrymandering the ad hockery we perform on our auxiliaries!” (p. 111)

“Money in the bank” a.k.a. the “track record” principle a.k.a the Serlin-Lapsley principle

WWE Money in the Bank logo What we all crave

While our theories are almost certainly incomplete and false in the literal sense, they may be “good enough.” For example, the current most precise and explicit theory of the physical and chemical processes making up our sun will likely turn out to be missing some parts and have some parameters wrong (at some level of precision). Does that mean it’s wrong? Does that mean it’s wrong to say that the sun is a big ball of hot gas, mostly hydrogen? Does that mean it’s no better than the theories that the sun is a glowing gigantic iron cannonball (Anaxagoras) or the chariot of the sun god? (Meehl’s examples)

Good enough means “having enough verisimilitude to warrant continued effort and testing it, amending it, and fiddling in honest ad hockery (not ad hoc of Lakatos’s three forbidden kinds) with the auxiliaries.” (p. 115, Meehl’s formulation)

What the Stephen is verisimilitude?

Screenshot of Stephen Colbert explaining truthiness "Truth that comes from the gut, not books." Stephen Colbert, American philosopher

For the long version, read the section intitled “Excursus: The Concept of Verisimilitude.” The short of it is that we basically never think of a theory as completely true or false. We have an intuitive sense for the “truthiness” of a theory. Meehl gives the example of a newspaper account of a car accident. If the account gets everything right except for the middle name of one of the drivers, we would not call the account “false.” If it gets a driver’s entire name and the name of the street wrong, we might think it poor journalistic craftsmanship but still somewhat true. Similarly, because a theory consists of so many different things (entities, relationships, functional forms, parameters), some of these can be wrong or inaccurate in different ways without us thinking that the theory is entirely without merit.

How does a theory get money in the bank?

The Salmon principle (Wesley Salmon, not the fish): “The main way a theory gets money in the bank is by predicting facts that, absent the theory, would be antecedently improbable.” (p. 115)

The “damn strange coincidence” maxim (a.k.a. Reichenbach’s maxim, a.k.a Novalis’s maxim): “If your aim is causal understanding of the world, do not adopt a policy of attributing replicable orderliness of observations to a damn strange coincidence.” (p. 117)

The more precise your predictions are, the more money you accumulate and the more we are inclined to count even “near misses” in your favor. See the example above about predicting rainfall in April.

Why are things particularly bad for the social sciences?

Social science theories have a hard time convincing the reader that “if not for T, the statement ‘if O1, then O1’ is unlikely [implausible, nonsensical, whatever].” Basically, social science theories produce few “damn strange coincidences” and, therefore, generate little money in the bank.

Why do they generate so few damn strange coincidences?

Because of the crud factor. Duhduhduhhhhhhh…

Ball of yarn This is your DAG on crud

The crud factor: “Everything is correlated with everything, more or less.” (p. 123) Meehl thinks that this is obvious to anyone with more than cursory experience with social science research. As an illustration, he talks about his experience with the the Minnesota Multiphasic Personality Inventory (MMPI):

The main point is that, when the sample size is sufficiently large to produce accurate estimates of the population values, almost any pair of variables in psychology will be correlated to some extent. Thus, for instance, less than 10% of the items in the MMPI item pool were put into the pool with masculinity–femininity in mind, and the empirically derived Mf scale contains only some of those plus others put into the item pool for other reasons, or without any theoretical considerations. When one samples thousands of individuals, it turns out that only 43 of the 550 items (8%) fail to show a significant difference between males and females. […] when Lykken and I ran chi squares on all possible pairwise combinations of variables, 92% were significant, and 78% were significant at p < 10–6. Looked at another way, the median number of significant relationships between a given variable and all the others was 41 of a possible 44. One finds such oddities as a relationship between which kind of shop courses boys preferred in high school and which of several Lutheran synods they belonged to! (p. 124)

[…]

The crud factor is not a Type I error. […] The problem is methodological, not statistical: there are too many available and plausible explanations of an xy correlation, and, besides, these explanations are not all disjoint but can often collaborate.” (p. 125)

You can find the long version of this in Meehl 1990 [2] (“6. Crud factor”).

So, what now?

A big chunk of the second half of the article is about how to measure the “track record” of a research program. Meehl admits that his proposed index is rough and tentative. I will ignore that entire topic and try to pull out the bits that might be interesting for social scientists.

A central recommendation by Meehl is to stop bothering with p-values and to focus instead on predictions and how far off they are.

Predictions are a bit of a bogeyman for social scientists. But Meehl highlights that saying “we expect children from poor households to have lower education” is a prediction (albeit a verbal and very broad one). It is equivalent to predicting “the [beta coefficient, correlation coefficient] between poverty and education is significantly different from zero and is negative.” Meehl suggests that one need not make uber-precise point predictions to make progress. He arguest that making interval predictions or functional-form predictions and then (hopefully) finding empirical agreement from different measurement approaches would do a lot more than yet another study with betabla***. You can find more details in sections “Appraising a Theory: Point and Interval Predictions” and “Appraising a Theory: Function-Form Predictions”.

How do we measure how far off we are? Here is Meehl:

The crucial thing is, I urge, not the standard error, or even (somewhat more helpful) the engineer’s familiar percentage error, but the size of the error in relationship to the Spielraum [range of potential values considered reasonable]. (p. 128)

First, a clarification. P-values are not a measure of how far off we are, at least the way they’re used in 99% of social science articles. P-values (in null hypothesis significance tests) give you the probability of observing a given coefficient (usually some correlation between y and x) in the sample if there were absolutely no relation between the two variables in the population. But nobody actually believes that last part. So the p-value is pretty much useless. Meehl argues that we should not be interested in how big the error is in relation to the variance of the particular study/sample/context in which it was estimated but how big it is in relation to the range of potential values of the parameter predicted by our theory [he calls this “Spielraum” because he wants to show that he has read some Austrian logical positivists]. For example, if you measured a distance and you tell us that you were off by 1000 miles, our applause will depend a lot on whether you wanted to measure the distance between two cities, between the earth and the moon, or between the sun and Alpha Centauri. Even before doing any measurement, if your outcome measure goes from -1 to 1 and you predict that it should be > 0, that is a lot less impressive than if you predict that it should be between 0.1 and 0.3. Now imagine that you measure it to be 0.4. In the first scenario, you “succeeded.” In the second scenario, you “failed.” But Meehl thinks that the second theory would deserve a lot more attention than the first because it made a prediction that covered only 0.1 of the space of the outcome variable (as opposed to 0.5) and its error was 0.05 of that space.

How to put this into practice? I don’t know. But here are some connections to existing work.

Define your estimand. The connections to the Lundberg et al. paper [3] should be pretty obvious. Figure 2 in this paper and figure 1 in Lundberg et al. express the same idea. Not that this really clarifies things. As we discussed last time, the practical implications of Lundberg et al. are not very clear.

Multiverse analyses. Multiverse analyses were developed, afaik, as a counter to p-hacking and harking. But you could also use them as a first step towards defining a plausible range for predictions or a partial estimate of the “crud factor” for some topic.

Regularized effect sizes. I wonder what Meehl thought off Cohen’s d and similar efforts. I didn’t google it and I think it could be cool to think through it before doing so. Since most of these use some measure of the variability in the sample, I am tempted to say that Meehl would say that they are slightly besides the point. But he talks about the error not the effect estimate. If we extend his point, I would suspect that he would insist on substantive over statistical significance and the importance to find some appropriate loss function (e.g. an effect of this size translates into real-world effects of this much life expectancy gained for this amount of investment).

Type S and M errors. Andrew Gelman and John Carlin [4] suggested that when assessing a corpus of studies in light of file drawer problems or replication efforts or when doing ex ante power calculations for a studies, we should not focus on type I and II errors but on type S and M errors. S means we get the sign wrong and M means we get the magnitude wrong.

Bonus: Is It Ever Correct to Use Null-Hypothesis Significance Tests?

Actually, yes, there are situation where NHST are appropriate. See p. 137. One of the examples: when we are only interested in evaluating a technique, like a new medication or fertilizer. If one shows 7% “more” effect than another, we only want to know how likely that is to arise by chance. If it’s different from chance, we want to do the thing, never mind what we think of the motivating theory.

References

[1] Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It
Meehl, P. E. (1990).
Psychological Inquiry, 1(2), 108-141.
https://doi.org/10.1207/s15327965pli0102_1

[2] Why summaries of research on psychological theories are often uninterpretable
Meehl, P. E. (1990).
Psychological reports, 66(1), 195-244.
https://psycnet.apa.org/doi/10.2466/PR0.66.1.195-244

[3] What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory
Lundberg, I., Johnson, R., and Stewart, B. M. (2021).
American Sociological Review, 86(3), 532-565.
https://doi.org/10.1177/00031224211004187

[4] Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors
Gelman, A. and Carlin, J. (2014).
Perspectives on Psychological Science, 9(6) 641-651.
https://doi.org/10.1177/1745691614551642