Note: this has been written as a submission for the Rethinking Capitalism that I've taken as part of the MPA programme at UCL IIPP. By the end of writing the article I wanted to argue the opposite side, against RCTs in social sciences, but it is such a complex issue that I cannot take one stance full-heartedly.
This year’s awarding of the (mislabeled) Nobel Prize in Economics to Abhijit Banerjee, Esther Duffo and Michael Kramer has put the use of Randomized Controlled Trials (henceforth abbreviated as RCTs) as a tool for policy-makers in the spotlight once again. In “The poverty of poor economics”, (Chelwa and Muller, 2019) the authors leave us with a dire warning against using this practice in development economics, citing concerns about the ethical implications of RCTs, the colonialist and technocratic implications of RCTs, and the lack of irrefutable evidence of RCTS overall validity, usefulness and superiority as a policy tool.
I will make the argument that while there are many valid concerns regarding RCTs, the article mentioned above makes a series of straw-man arguments whilst not addressing other valid arguments, such as those of purely epistemological nature. I will start by offering definitions of concepts such as RCTs, internal and external validity. I will then delve deeper into epistemological questions regarding RCTs. Then I will run through the main arguments of the article in discussion and discuss their claims.
I will attempt to provide a more holistic approach, which leaves room for using RCTs along with other evidence-gathering and hypothesis-testing tools and conclude by accepting that RCTs are neither silver bullets, nor gold standards when it comes to proving that a policy intervention is useful.
a.Randomised Controlled Trials
Randomised controlled trials are experiments in which subjects in a sample population are randomly assigned in different groups, which receive different “treatments” or interventions, or no intervention at all – that group is known as the control group. The experimenters then observe for a difference between the groups in a certain measure. The logic behind them is that by allocating individuals at random, they are “exactly comparable” (Banerjee and Duflo, 2011, p. 14), so we can deduce that any observed difference in the measure of interest after the intervention is the result of that intervention.
They were pioneered in the biomedical research, where they have become the “gold standard” for clinical trials. (“Randomized controlled trial,” 2019)(Cartwright, 2007)
RCTs are not the only type of field experiments, but they have gained popularity recently, predominantly on their alleged strength of increased the validity of the experiment. There are two main types of validity, internal and external.
b.Internal Validity vs External Validity
In “The Hidden Half”, Michael Blastland gives this brief distinction between internal and external validity:
Internal validity means we think we’re on solid ground within the original terms of a study in its original context. External validity means broadly that this knowledge travels to other contexts.
(Blastland, 2019, p. 101)
He then follows up with “Research is externally valid if its findings apply outside its own setting.” - what he calls “generalizing from here to there”.
More formally speaking, a study is internally valid if one can logically prove that the data gathered in an experiment proves a causal relationship between an interaction and an observed effect. As Nancy Cartwright points out in “Are RCTs the Gold Standard?” (Cartwright, 2007), RCTs are (ideally) high in internal validity (as a result of randomisation), but this comes as a trade-off with external validity, because of the “methodological rigours” that RCTs impose on the trial population.
III.What can we actually know from RCTs?
Formally, the results of a given RCT are only true for the population which is enrolled in the study. (Cartwright, 2007) The method does not give any assurances regarding external validity. This is repeated in Chelwa and Muller’s article, adding “(our rough guess is more than 90%) [of RCTs] have no formal basis for generalization”. The article they link to does not provide any mathematical basis for that estimation. (Muller, 2015).
Cartwright contrasts RCTs and other clinching methods of proving causal claims (i.e. methods “that clinch the conclusion but are narrow in their range of application”, deductive methods) with ethnographic and expert judgment, showing that the main advantage of the first is the ease with which we can determine when the results of a study are reliable evidence of the effect we’re looking for (“when the background assumptions are met”). The downside to these methods are that they fall prey to the “weakest link principle” —if one premise in your reasoning is uncertain, the whole edifice of certainty crumbles down. “Suppose you have 10 premises, 9 of them almost certain, one dicey. Your conclusion is highly insecure, not 90% probable.”. Or as (Blastland, 2019, p 114) puts it: “Half-reliable is simply not reliable, like a partner who tells you he or she is 50% faithful: it is a contradiction in terms.”
Even critics accept that this is not exclusive to RCTs, or to social sciences or policy-making, and that “Most of our knowledge claims, even in our securest branches of science, rest on far more premises than we like to imagine, and are far shakier. This recommends a dramatic degree of epistemic modesty.” (Cartwright, 2011)
IV.What is (allegedly) wrong with RCTs and does it matter?
Chelwa and Muller’s main arguments against RCTs are the following (paraphrasing):
- The usefulness and credibility of RCTs are unfounded.
- “They inform a missionary complex” – I understand this as a mix of colonialist and technocratic tendencies in the thinking of the proponents of RCTs.
- The policies that spring forth from RCTs are no better than the alternatives.
- Summing up, they pose a threat to the progress of developing countries.
I will attempt to refute each of these claims or show why they are irrelevant.
a.Are RCTs useful and credible?
What makes an experiment, or a whole class of experiments useful or credible?
I would posit that any experiment that is internally valid and with clearly defined andacknowledged limitations is useful, in the sense that it provides information, albeitfor a very specific setting (as discussed above, strictly speaking an experiment informs only on the effects of an intervention on the population that was part of it).
Banerjee and Duflo are highly conscious of the need for internal validity. In “Poor Economics” (Banerjee and Duflo, 2011), the book that Chelwa’s and Muller’s article references and which popularized their efforts, the authors insist on the need for internal validity. When talking about an experiment to see the best way to make people in a malaria-infested region use insecticide-treated bed nets, they say:
“To answer these questions, we would need to observe the behavior of comparable groups of people facing different levels of subsidy. The key word here is "comparable".”
(Banerjee and Duflo, 2011, p. 7)
Complete comparability when it comes to people is impossible, but randomly assigning people to the control groups is the easiest way to address that. Of course, it is hard to know if our distribution or our sampling is completely random if we don’t know all the variables that can influence our experiment. I completely acknowledge this limitation, and can offer no solution to it other than iterative experimentation.
Regardless, as we’ve seen above, internal validity of RCTs is acknowledged as high even by detractors such as Cartwright and Muller. Thus considering any study that is internally valid as (at least marginally) useful, we can say that RCTs are useful.
Credibility is ultimatelysubjective, but I would posit that when it comes to scientific studies, it stemsfrom peer review and most importantly from external validityproven by replication studies.
This is, in my opinion, the first straw-man argument that Chelwa and Muller make: if credibility is of concern, and credibility is derived from external validity, they wouldhave a point if proponents of RCTs would be making the argument that one perfectly executed RCT is all the evidence that is needed to advise on a policy in all developing countries.
That is not the case. Once again quoting from “Poor economics”:
A single experiment does not provide a final answer on whether a program would universally "work". But we can conduct a series of experiments, differing in either the kind of location in which they are conducted or the exact intervention being tested (or both). This allows us to both verify the robustness of our conclusions (Does what works in Kenya also work in Madagascar?) and narrow the set of possible theories that can explain the data. […] The new theory can help us make sense of previous results that may have been puzzling before.
(Banerjee and Duflo, 2011, p. 14)
This passage seems to me to show a high degree of concern with both internal and external validity. Of course, repeating experiments is only edifying with regards to external validity if the heterogeneity of the factors involved in the experiment is in observables. (Martin, 2009)
In conclusion, if utility is given by internal validity and credibility by external validity (one could make the case that it’s the other way around, but it does not matter), then I think RCTs are indeed both useful and credible, if marginally so.
b.The Missing Missionary Complex
The hollowest accusation in the article is the belief that RCTs “inform a missionary complex”. I have interpreted this to mean a mix between a colonialist-paternalistic and technocratic attitudes.
Whilst I cannot speak to the true intentions, moral fabric or real-life attitudes of the Nobel prize winners or other proponents of RCTs, but from their articles and their book “Poor Economics”, they seem intellectually humble and not at all morally judgmental– they do not make sweeping claims about the nature of human nature, nor do they claim that they hold all the answers.
Talking about the limits of their book, “Poor Economics” they say:
“This book will not tell you whether aid is good or bad but it will say whether particular instances of aid did some good or not. We cannot pronounce on the efficacy of democracy, but we do have something to say about whether democracy could be made more effective in rural Indonesia by changing the way it is organized on the ground and so on.”
In the conclusion section of the book, they also offer this:
“[…] we are very far from knowing everything we can and need to know. This book is, in a sense, just an invitation to look more closely. If we resist the kind of lazy, formulaic thinking that reduces every problem to the same set of general principles; if we listen to poor people themselves and force ourselves to understand the logic of their choices; if we accept the possibility of error and subject every idea, including the most apparently commonsensical ones, to rigorous empirical testing, then we will be able not only to construct a toolbox of effective policies but also to better understand why the poor live the way they do.”
I fail to see a missionary complex in this description of their method. It lacks the morally normative way of thinking about poverty which characterizes what we would call a missionary attitude. It seems accepting of peoples faults and it is intellectually inquisitive, eschewing ideology.
c.Are RCT-based Policies Better?
With regards to Chelwa and Muller’s criticisms that there’s no evidence that RCT-based policies are better than alternatives, I will admit that there is indeed mixed evidence on the efficacy of evidence-based policy, even in developed countries (Dumont, 2019). There also seems to be disappointingly little use of evidence in policy-making, and calls have been made to make evidence more readily available and help build capacity within decision-making structures to absorb and interpret the available evidence (Oliver et al., 2014). I would argue however that this is par the course, considering the relatively short time-span since the use of RCTs in policy has been established. The Poverty Action Lab, the pre-eminent institution in evidence-based policy was set up only in 2003 (“About us | The Abdul Latif Jameel Poverty Action Lab,” n.d.). If we think about the time horizons in which policy approaches become mature, I don’t think we should expect evidence-based policy to have reached it’s zenith yet.
Finally, I would add that blaming the Nobel laureates or their methods for the lack of good policy or proof thereof is similar to blaming Karl Marx for the horrors of the USSR and the Eastern Block, or Keynes for stagflation: they provided us with an intellectual tool and a world-view, but it is up to policy-makers themselves to use them as tools for good and to make sure that they are appropriate for their specific use case.
d.A Peril to the Progress of Developing Countries?
The warning that Chelwa and Muller leave us with is that RCTs can endanger the future of developing countries. I find this claim dubious at best. Obviously, blind faith in what is ultimately a data-gathering and reasoning method, such as RCTs, is dangerous, as is any type of Cargo Cult Science (Small, 2009) or ideology.
But RCTs can only become a peril if policy-makers and academics in developing countries let them. That is, if policy-makers absolve themselves of the responsibility of making what are ultimately politicaldecisions, and if academics do not challenge the findings of studies and try to replicate them. If treated appropriately – as a toolfor confirming policy hypotheses, and nothing more – RCTs can help come up with better policies, optimize existing policies or discrediting bad ones.
V.Is there no alternative?
A purely empiricist and quantitative way of seeing the world is a limited world-view indeed. But I think even in the circles of randomistas, as the proponents of RCTs are sometimes known, you will be hard pressed to find many people that have that world view. “Poor Economics” makes the argument not only for RCTs, but also qualitative evidence (Banerjee and Duflo, 2011, p. 15).The view that RCTs are purely a quantitative approach comes following the availability heuristic. (PALUCK, 2010).
Hybrid approaches involving both quantitative and qualitative aspects have been described in the literature (ibid), with the role of qualitative data either to uncover the causal explanationsin what would otherwise be behavioural black boxes (“qualitative measurement within a field experiment leads to a better understanding of the causal effect, suggests plausible causal explanations, and uncovers new processes that are invisible from a distance” (ibid)), to deepen the quality of data collected in a RCT (by complementing quantitative and qualitative data or eschewing quantitative data altogether), to help study program implementation or help development of Theories of Change and Logic Models (Qualitative Research Methods in Program Evaluation: Considerations for Federal Staff, 2016).
Collecting qualitative data on program effects might not only help gather more information, but it could also help popularise certain findings and policy proposals. As explained in “Experimental Ethnography”:
“They [the audience] may be skeptical that young mothers visited in their homes by registered nurses will be much less likely to abuse their children and that those children will become much less likely to commit crimes when they grow older. These conclusions are hard to accept when they are stated only in terms of numbers. Yet if the numbers are linked to stories about how these program effects actually happened with real human beings, an audience might be more likely to understand and accept the numerical conclusions.”
(Sherman and Strang, 2004)
Envisioning a future for Evidence-Based Policy has lead some to acknowledge the huge disconnect not just between research methods, but between researchers and policy-makers (Dumont, 2019; Tseng, 2014) and arguing for a more integrated approach where policy-makers feed back into the academic results brought in as evidence. This fits in nicely with the qualitative approaches mentioned above and with the worry that focusing too much on narrow outcomes can obscure the “bigger picture” of the systems change that is needed to effect long-lasting and significant change in getting not just individuals, but communities and societies out of poverty and misery. Other things to take into account in future interventions are spillover effects (Martin, 2018, 2009), which RCTs are notoriously bad at capturing.
Empiricism and quantitative data have helped debunk some of the theoretical economical models of neoclassical economics, through the work of behavioural economists (Thaler, 2016). You cannot explain the world entirely through these kind of methods, just like you cannot entirely describe it in theoretical models. RCTs are by no means a holy grail when it comes to informing policy, whether we are talking about developmental economics or health policies. But I don’t think anyone in their right mind thinks they are (this is another straw-man argument in the article). It would be foolish to rely on RCTs exclusively, but it would be, in my opinion, more foolish to discard them completely. There is so much we just do not know about the world and why people, both individually and in aggregate, behave the way they do that it would seem foolish to rage against what is ultimately a pretty clever way of determining a course of action.
Policy making is not an exact science. It is ultimately a fundamentally moral and political endeavour. But it should not be unscientific or anti-scientific.Evidence should play a role in determining the best course of action for a society, lest we insist on broken policies for the sake of ideology. We have to be careful not to allow the tail to wag the dog, though. As Michael Blastland elegantly expresses this: “Evidence cannot supplant normative values; it is a means, not an end.” (Blastland, 2019, p. 232). So let’s use all available means at our disposal, cross-check everything, but let’s not fall prey to the idea that you only have a hammer and everything is a nail.
About us | The Abdul Latif Jameel Poverty Action Lab [WWW Document], n.d. URL https://www.povertyactionlab.org/about-j-pal (accessed 1.5.20).
Banerjee, A.V., Duflo, E., 2011. Poor economics: a radical rethinking of the way to fight global poverty, 1st ed. ed. PublicAffairs, New York.
Blastland, M., 2019. The hidden half: how the world conceals its secrets. Atlantic Books.
Cartwright, N., 2011. Predicting what will happen when we act. What counts for warrant? Preventive Medicine 53, 221–224. https://doi.org/10.1016/j.ypmed.2011.08.011
Cartwright, N., 2007. Are RCTs the Gold Standard? BioSocieties 2, 11–20. https://doi.org/10.1017/S1745855207005029
Chelwa, G., Muller, S., 2019. The poverty of poor economics [WWW Document]. Africa is a Country. URL https://africasacountry.com/2019/10/the-poverty-of-poor-economics (accessed 1.2.20).
Dumont, K., 2019. Reframing Evidence-Based Policy to Align with the Evidence. William T. Grant Foundation.
Martin, R., 2018. Should the Randomistas (Continue to) Rule? - Working Paper 492 [WWW Document]. Center For Global Development. URL https://www.cgdev.org/publication/should-randomistas-continue-rule (accessed 1.5.20).
Martin, R., 2009. Should the Randomistas Rule? The Economists’ Voice 6, 1–5.
Muller, S.M., 2015. Causal Interaction and External Validity: Obstacles to the Policy Relevance of Randomized Evaluations. World Bank Econ Rev 29, S217–S225. https://doi.org/10.1093/wber/lhv027
Oliver, K., Innvar, S., Lorenc, T., Woodman, J., Thomas, J., 2014. A systematic review of barriers to and facilitators of the use of evidence by policymakers. BMC Health Services Research 14, 2. https://doi.org/10.1186/1472-6963-14-2
PALUCK, E.L., 2010. The Promising Integration of Qualitative Methods and Field Experiments. The Annals of the American Academy of Political and Social Science 628, 59–71.
Qualitative Research Methods in Program Evaluation: Considerations for Federal Staff, 2016. . Office of Data, Analysis, Research & EvaluationAdministration on Children, Youth & Families.
Randomized controlled trial, 2019. . Wikipedia.
Sherman, L.W., Strang, H., 2004. Experimental Ethnography: The Marriage of Qualitative and Quantitative Research. The Annals of the American Academy of Political and Social Science 595, 204–222.
Small, M.L., 2009. `How many cases do I need?’: On science and the logic of case selection in field-based research. Ethnography 10, 5–38. https://doi.org/10.1177/1466138108099586
Thaler, R.H., 2016. Misbehaving: the making of behavioural economics, An Allan Lane book. Penguin Books, London.
Tseng, V., 2014. What Might Evidence-based Policy 2.0 Look Like? [WWW Document]. William T. Grant Foundation. URL http://wtgrantfoundation.org/what-might-evidence-based-policy-2-0-look-like (accessed 1.3.20).