How to “hack” the sample size requirements (statistical power of RCTs)?
As we discussed in the previous chapter, collecting the data can be challenging due to financial, ethical, or time-related reasons. In addition, sometimes, the researchers may want to answer questions that affect a smaller sample; thus, it naturally limits the size of the dataset. This chapter will discuss what sample size entails in the context of randomized control trials and experimental power, then it will explore the different randomization techniques to ensure covariate balance with small sample size.
As stated in Rubin’s causal model, the fundamental problem of causal inference is that we do not observe the alternative scenarios simultaneously in real life (Rubin, 1974). In other words, we cannot observe the outcomes of the same individual with and without the treatment to calculate the true treatment effect. Therefore, in randomized experiments with pure randomization, the researchers will randomly assign each unit to a control or a treatment group to infer causality by comparing their outcomes. If the sample size is big enough, it is expected to have a control group that is similar to the treatment group on average. For an unbiased estimate of the average treatment effect, the comparison groups need to be balanced on the observed and unobserved pretreatment characteristics. To infer causality based on a comparison of the control and treatment groups, the researchers need to have a big enough sample size that will guarantee the normal distribution of the comparison groups’ covariates. Therefore, the main concern regarding the small sample size is that randomization might not grant a balanced distribution of covariates for the treatment and the control groups.
The researchers can employ different hypothesis testing such as Fisher’s exact test to prove that their results are not due to mere chance but due to the causal mechanism that they are investigating.
Fisher’s exact test
Fisher’s exact test is one method to test the statistical significance with binary outcomes and smaller sample sizes. Fisher’s exact test aims to discover if there is a Type I error by testing the null hypothesis. By calculating the p-value, it shows how likely these results will be observed at random. Based on the p-value, if the researchers fail to reject the null hypothesis, it is concluded that the average effect size observed is not due to a causal mechanism but is due to mere chance (Fisher, 2010).
The random assignment of the units to treatment and control groups can be achieved through different randomization methodologies such as pure randomization, stratification, pairwise, and rerandomization to mitigate the possible imbalances on observable covariates. Depending on the experiment design and available baseline data, employing the right randomization methodology is useful to improve post-randomization covariate balance and the experimental power. In the rest of this section, I will focus on how different randomization techniques can improve covariate balance and power for experiments with small sample size.
Pure randomization
Pure randomization is when a unit is completely randomly assigned to a treatment or a control group. Based on the central limit theorem, the researchers use pure randomization when they have a big sample size; thus, it is not suitable for small sample sizes. (Sures, 2011)
Rerandomization
R.A. Fisher criticized pure randomized experiments by saying “Most experimenters on carrying out a random assignment of plots will be shocked to find how far from equally the plots distribute themselves (Fisher, 1992).” Even though R.A. Fisher strongly argued against the systematic distribution or balanced assignment of the treatment, in his conversations with L. J. Savage, he said that he would redraw the treatment assignment if it is obvious from the baseline data that the first randomization did not grant balanced distribution (Fienberg, & Hinkley, 2012). Checking covariates after the first randomization and if the baseline covariates are unbalanced, rerandomizing the units is another randomization methodology. Rerandomization can be employed if the covariate data is available before the experiment has started (Morgan, 2011). Later on, Rubin and Morgan discussed how rerandomization can be used to improve covariate balance while preserving the benefits of randomization to infer causality (Morgan, & Rubin, 2012). Morgan and Rubin used Figure 1 to illustrate the procedure for implementing rerandomization. The first 2 steps of the 6 steps procedure are the most crucial ones to decide if rerandomization is the appropriate methodology in a given scenario. Firstly, pre-treatment covariate data must be available for all the units to evaluate the covariate balance of each randomization. Secondly, to prevent p-hacking, the researchers should predetermine the balance criteria on whether to accept or reject the first randomization. It is also a popular suggestion by Kempthorne and Tukey to specify an acceptable group of randomization to choose randomly from this group of randomizations to mitigate selection bias (Morgan, & Rubin, 2012). Lastly, for academic integrity and transparency, the researchers should report the reasons for discarded randomizations (Rubin, 2008).
There are additional considerations for rerandomizing with a small sample size because as the sample size decreases, the number of possible randomizations decreases (Morgan, 2011). Therefore, the predetermined balance criteria should be flexible and decided based on the empirical distribution of the Mahalanobis distances. Mahalanobis distance is a scalar measure for multivariate covariate balance. The distribution for the Mahalanobis distances can be calculated by simulating numerous randomizations and calculating the Mahalanobis distance for each randomization. Then the threshold for the balance criteria can be decided based on the percentiles of Mahalanobis distance distribution (Morgan, 2011).
Another critique to the rerandomization is that Fisher’s exact test does not account for it. Thus its results cannot be interpreted for hypothesis testing. However, using a full permutation test by incorporating all the accepted randomizations is a feasible alternative to Fisher’s exact test for hypothesis testing (Morgan, & Rubin, 2012). Lastly, during the analysis, the variables in the balance criteria should be controlled because the treatment will be conditional on these variables (Bruhn, & McKenzie, 2009).
Stratification
Stratification is a randomization methodology to mitigate the covariate imbalance by stratifying on observed baseline characteristics such as gender, age group, location, etc. Within each stratum based on the predetermined characteristics, the experimenter would randomly assign the units into the control or treatment group. While deciding on which variables to stratify on, the researchers should consider the variables that directly affect the outcome of interest. Otherwise, the experiment will be underpowered which means that the results cannot estimate the treatment effect (Glennerster, & Takavarasha, 2013). Also, if there is a subgroup analysis that is valuable to investigate, it should also be considered while deciding on the variable to stratify on (Heard et al., 2017). The number of variables used to form strata cannot be a significant amount because the number of stratified characteristics will limit the number of units that are in each stratum. Therefore, the studies generally use up to four strata. According to a survey that is conducted among 25 researchers who are either affiliated with the World Bank, the Abdul Latif Jameel Poverty Action Lab, or the Bureau of Research and Economic Analysis of Development (BREAD), 14 of 15 most recent RCTs used stratification. Among these RCTs, “six used only one variable, typically geographic location; four used two variables (e.g., location and gender); and four used four variables” (Bruhn, & McKenzie, 2009).
On another note, studies generally do not specify the number of strata used or how they stratified on continuous variables. “For example, Banerjee et al. (2007) write “assignment was stratified by language, pretest score, and gender”. In this case pre-test score is continuous, and it is not clear how it was discretized for stratification purposes” (Bruhn, & McKenzie, 2009). As the number of stratified variables is important for calculating the standard error of the estimated treatment effect, it is crucial to share this information for criticism, replication, and internal validity. However, only 1 out of 14 studies stated that stratification was taken into account while calculating standard error (Bruhn, & McKenzie, 2009).
Pairwise matching
In contrast to stratification, pairwise matching can be used to improve the balance for numerous covariates simultaneously. Pairwise matching is a special case of stratified randomized experiments where each stratum contains only 2 units and one of them will be assigned to the treatment group while the other one is assigned to the control group. Pairs are matched based on the covariates so that the variation among their potential outcome will be close to zero (Imbens, & Rubin, 2015). In other words, by matching variables, researchers try to find the units that are most similar to each other on covariates and outcome variables so the true treatment effect can be estimated. Pair can be matched using different methodologies like optimal multivariate matching, optimal greedy algorithm, or even by hand with the aim to minimize the Mahalanobis distance among all the chosen covariates (Bruhn, & McKenzie, 2009).
However, it should be acknowledged that as the number of covariates to match on increases, the time that is required to finalize the computation increases too. This might be specifically disadvantageous in the cases where there are limited computational capacity and tighter time frames to move on to the other stages of the RCT. On the other hand, it has an easier statistical analysis and oftentimes it achieves a better covariate balance than rerandomization (Bruhn, & McKenzie, 2009). Since all the variables that are matched on should be controlled during the statistical analysis, there will be a trade-off between matching on more variables and degrees of freedom.
Comparison of different randomization methodologies
Bruhn and McKenzie (2008) use four-panel data to simulate 4 different randomization methodologies (pure randomization, rerandomization, stratification, and pairwise randomization) to see how they perform with small sample sizes. To test with different sample sizes they draw subsamples of 30, 100, and 300 from different panel data. To investigate the covariate balance performance of different methodologies, they compare the means of the outcome variable baseline. Since all of the differences on the baseline are close to zero, they conclude that all methods lead to covariate balance on average.
Overall, based on their simulation results, the method of randomization gains more importance as the sample size gets smaller (30 or 100 observations) Additionally, they concluded that pairwise randomization outperformed stratification and rerandomization while balancing small samples. Also, they noted that stratified and rerandomization randomization performed better than pure randomization in small samples. However, when they simulated with 300 observations, they concluded that the method of randomization is less crucial for balancing the comparison groups. On the other hand, pairwise randomization still improves the balance on observed variables; thus, given the current evidence, pairwise randomizations seem to be the best available randomization methodology to minimize the observed covariates’ imbalances.
Resources
Fienberg, S. E., & Hinkley, D. V. (Eds.). (2012). RA Fisher: an appreciation (Vol. 1). Springer Science & Business Media.
Fisher, R. A. (1992). The arrangement of field experiments. In Breakthroughs in statistics (pp. 82–91). Springer, New York, NY.
Glennerster, R., & Takavarasha, K. (2013). Running randomized evaluations: A practical guide. Princeton University Press.
Heard, K., O’Toole, E., Naimpally, R., & Bressler, L. (2017). Real world challenges to randomization and their solutions. Pages (16–19) J-PAL North America, Cambridge. Retrieved from https://www.povertyactionlab.org/sites/default/files/research-resources/2017.04.14-Real-World-Challenges-to-Randomization-and-Their-Solutions.pdf
Imbens, G., & Rubin, D. (2015). Pairwise Randomized Experiments. In Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction (pp. 219–239). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139025751.011
Rubin, D. B. (2008). Comment: The design and analysis of gold standard randomized experiments. Journal of the American Statistical Association, 103(484), 1350–1353
Suresh, K. P. (2011). An overview of randomization techniques: an unbiased assessment of outcome in clinical research. Journal of human reproductive sciences, 4(1), 8.
Morgan, K. L. (2011). Rerandomization to improve covariate balance in randomized experiments. Retrieved from http://www2.stat.duke.edu/~kfl5/Lock2011.pdf
Morgan, K. L., & Rubin, D. B. (2012). Rerandomization to improve covariate balance in experiments. The Annals of Statistics, 40(2), 1263–1282. Retrieved from https://projecteuclid.org/journals/annals-of-statistics/volume-40/issue-2/Rerandomization-to-improve-covariate-balance-in-experiments/10.1214/12-AOS1008.full