Administrative Data for Randomized experiments
During the interviews, the common concerns raised regarding budget were related to the cost of data collection and expert time. Therefore, the first section of this series of articles will focus on some of the tools and methodologies that can be used to lower the cost of data collection without lowering the quality of the data.
The Current Landscape of Administrative Data
Data collection is a significant cost while conducting Randomized Control Trials (RCTs). Most of the time, the data required to answer the question under discussion is not currently available either due to size, frequency, missing variables, etc. (Cole et al., 2020). This list goes on as each case is specific to their given condition. However, some data collection methodologies such as administrative data and phone surveys can be utilized more effectively to make RCTs faster and more frugal.
With the increased integration of technology into our daily lives, the dimension of data collection has dramatically changed. From GPS coordinates on Foursquare to tweets on Twitter, each second, people are creating data. Administrative data is any type of data collected by the government or any other third-party organization as a byproduct of their service (Bjärkefur, 2021). As governments and other parties started to integrate more electronic systems to document and track the beneficiaries, the variety and quality of administrative data have drastically increased. E-health records, electronic ticketing for transportation, school enrollments, etc., are some examples of administrative data that can be used for research purposes.
Previously, in-person data collection was the primary way of compiling data for impact evaluation. However, large-scale surveys generally require higher investment in human resources, intricate design for sampling, and different methodologies to increase turn-out rate and compliance rate. On the contrary, administrative data is collected at larger scales as a side product of a given service by the government or third parties (Cole et al., 2020). These data sets can be utilized for research purposes by the decision-makers at the government and the companies. Using administrative data for randomized experiments has already been done by various projects, which will be mentioned later in the case studies section to illustrate the possible applications. Another advantage of administrative data is that it tends to have a higher frequency and immense geographical coverage. The higher frequency of administrative data makes it easier to repeat the analysis routinely at a lower cost (Trias, 2017). Lastly, using administrative data for policy research is a field that has been extending in the last decade. There is an increase in the use of administrative data in economics journals and World Bank’s Development Impact Evaluation (DIME) publications. The upward trend in the usage of administrative data in policy research can be seen in Figures 1 and 2. Even though graphs are not specific to randomized experiments, as the administrative data usage in impact evaluation increases, the databases are likely to become more accessible.
This technological advancement can be turned into an advantage for evidence-based policymaking as more nuanced data has never been collected or examined before. As these data have been collected regardless of the research question under discussion, the researchers do not have to spend resources on data collection. However, administrative data has its unique costs: convincing the admin to share data, completing required ethical guidelines (depersonalization of data, etc.), and adapting the existing data to the current problem. On the other hand, if there is a framework to make the acclamation of the data smoother, usage of the administrative data can reduce the cost of RCTs massively by lowering the cost of data collection.
Administrative data have been used for observational studies, but it can also be utilized in the context of randomized experiments. Three main things can be done to advocate for the administrative data use for randomized experiments:
- Understanding the value of the administrative data is required to incorporate it more effectively into impact evaluation. Raising awareness is highly important to break the stigma around using administrative data over survey data for faster and frugal results. Besides, collaborating with the administrative data holders will help the researchers identify which questions need to be answered and which interventions can be performed using administrative data. For example, since there was not a comprehensive database on patient safety in Kenya, the World Health Organization partnered with the Kenyan Ministry of Health to conduct a situation analysis in 2008 (World Health Organization, 2014). In 2010, the strategic plan for patient safety was outlined by the WHO (World Health Organization, 2010). Then later, the Kenyan Ministry of Health partnered with the World Bank’s DIME team to run a randomized experiment to understand how inspections can be utilized to improve patient safety. Based on the evidence, that inspection increases patient safety by 15 percent, the government is currently scaling up the intervention to the country level (Das, 2021). This is an example of how developing data systems that are integrated with the government can help to test, monitor, and implement policies on the country level.
- It is also important to incorporate the privately collected administrative data to fill the data gaps for public policy research. For example, the World Bank’s IeConnect for impact program aims to generate evidence to inform transport investments (Development Impact Evaluation, 2021). Specifically in Kenya, IeConnect digitized and combined the paper incident reports from National Police Service (NPS) with the administrative data from Uber and Waze. Combining the data enabled the researchers to identify the 200 deadliest crash sites which can be used for a randomized control trial to test possible infrastructure projects or other types of policy implementations (Legovini, & Jones, 2020).
- Lastly, promoting well-designed randomized experiments that use administrative data is required to create a portfolio of useful applications and advocate for administrative data use. Therefore, I will end this chapter with a list of case studies and publicly available administrative databases.
On another note, the Covid-19 pandemic introduced new challenges to different policy-making areas in the last year, which required fast reaction. This unusual scenario encouraged the policymakers to search for quicker and more frugal alternatives for impact evaluation experiments to identify the most optimal response to the current crisis. Therefore, the number of research that uses or analyzes the effectiveness of administrative data has been increasing.
Linking administrative data with survey data
One of the key points is that we need to acknowledge that administrative data does not have to be the only type of data used for a given evaluation. Thinking administrative data as a substitute for the survey data does not apply to all cases; thus, it is necessary to identify the causal question and to what extent this question can be answered using administrative data. Administrative data can be used as microdata as a complimentary observation.
Groves (2011) refers to administrative data as “organic data” while he defines the data that is created through surveying with a specific predetermined purpose as “designed data.” He perceives “designed data” as the most efficient tool to achieve a high information-to-data ratio as it is directly designed to inform the researchers on a specific subject. However, he also acknowledges that “organic data” has potential information that needs to be harvested (Groves, 2011). Therefore, combining designed and organic data is the way forward to gather the most efficient information-to-data ratio by leveraging available data.
Linking administrative data with survey data can be useful in two scenarios. Firstly, administrative data do not fully capture personal information such as personality traits, intentions, and cognitive skills which can be relevant to the research question (Arni et al., 2014). For example, for a study in labor market economics, the researchers can use the administrative data to find out how long a person has been unemployed, the programs that they took part in, income, etc. However, the administrative data can fall short if they would like to know job search behavior, personal preferences, social networks, and psychological factors (Kühn, 2015). Thus if these pieces of information are captured through two different data collection methodologies, they can be combined to study a specific question. For example, IZA projects combined the administrative data from the German Federal Employment Agency and survey data to study the out-mobility of individuals to work (Arni et al., 2014).
In addition to linking administrative data and survey data, each data collection methodology can be used at a different stage of the experimental design. For example, administrative data can be used to identify the eligible population, as in the case study in Armenia by Walque and Chukwuma (2020).
During experimental studies, data is generally used to decide who receives the treatment and how to analyze the outcomes of an intervention. Given the research question and the available data, the researchers should decide how to leverage the administrative data at different stages. In this section, I will highlight different case studies which utilized administrative data as a baseline survey and outcome data.
Administrative data have been used to conduct randomized control trials to answer various questions like the long-term effect of early childhood education (Chetty, et al., 2011) or how access to health insurance changes human behavior. In this section, I will highlight various case studies that used administrative data while conducting RCTs.
Firstly, I will start with an example from the nimble evaluations funded by the World Bank’s Strategic Impact Evaluation Fund (SIEF). The third round of SIEFs open call for proposals focused on the nimble evaluations to create a portfolio of nimble evaluations to be referred to in the future (SIEF, 2018). Unfortunately, due to Covid-19, some of the evaluations have been stopped. However, “Armenia: How can we increase screening for non-communicable diseases?” which is conducted by Damien de Walque and Adanna Chukwuma has been concluded before Covid-19 by using a combination of administrative data and home surveys. The main aim of the evaluation is to find out “the impact of demand-side interventions on screenings for hypertension and diabetes.” The study participants were individuals who have not been screened in the preceding year and aged between 35 to 68. They identified the target population by using the administrative data collected by the e-health system in Armenia. The study uses four different demand-side interventions, such as invitation or invitations with pharmacy vouchers to encourage people to go to screenings. Then the participants were randomly assigned to 5 study arms of the study. The participants have been contacted twice throughout the study. Firstly, by phone to check their eligibility and second in-person to complete the consent form and receive encouragement (Damien et al., 2020). This study is an excellent example as it uses the e-health data as the primary data to evaluate the eligible participants and track how they received the treatment. Utilizing the already existing systems like e-health systems is just one of the many examples of admin data’s applicability for impact evaluation.
Another example of a randomized experiment that uses administrative data in the health field is “Clinical decision support for high-cost imaging: A randomized clinical trial” (Taubman, et al., 2014). The study’s main aim was to evaluate if the healthcare cost can be decreased by lowering the number of inappropriate high-cost medical imaging. To identify the study participants, the researchers used Aurora Health Care’s database, which is a large-sized private healthcare provider in Wisconsin and Illinois. For half of the sample size, the researchers introduced a software intervention that helped with the clinical decision support and evaluated the appropriateness of the referred scanning based on the American College of Radiology guidelines. At the end of the 12-month study process, the researchers concluded that the intervention moderately increased the appropriateness of high-cost medical scans. This study is another example of how administrative data alone can be used to conduct randomized control trials because it utilizes the available infrastructure to conduct the experiment and uses software as a low-cost intervention to minimize the costs.
Concerns and possible solutions
As with most of the cases in which you deal with real-world applications, the use of admin data has its limitations. In this section, I will share the concerns around using administrative data, and in some cases, I will give possible solutions to mitigate these concerns.
One of the main issues in using administrative data is that these data sets include sensitive information about individuals and are not depersonalized in most scenarios. This situation is problematic while negotiating with authorities for access and publishing the data set for replication. Even though it does not apply to every case, requesting group-level data for RCTs can hinder the process significantly compared to requesting individual-level data (Schochet, 2020). Group-level data can be advantageous because individual-level records tend to have usage restrictions to protect the individuals’ identity. In some scenarios, group-level data can still answer the causal question while protecting individuals’ rights. For a more detailed revision of different methods to address data privacy issues, please refer to Matthews, & Harel, 2010.
Possible Fixed Costs
It is worth noting that using administrative data also has some fixed costs associated with it. First and foremost, the researchers may still need to pay for the data access if the government organizations refuse to give it for free. Secondly, since the administrative data is not collected for research purposes, the data might need organizing and framing (Trias, 2017). Even though data cleaning is a constant part of impact evaluation, it is still important to highlight it in the context of administrative data as it might require a lot of expert time. Thus it might increase the data collection-related costs.
Arni, P., Caliendo, M., Künn, S., & Zimmermann, K. F. (2014). The IZA evaluation dataset survey: a scientific use file. IZA Journal of European Labor Studies, 3(1), 1–20. Retrieved from https://izajoels.springeropen.com/articles/10.1186/2193-9012-3-6
Chetty, Raj. 2012. “Time Trends in the Use of Administrative Data for Empirical Research.” [Slideshow] http://www.rajchetty.com/chettyfiles/admin_data_trends.pdf.
Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W., & Yagan, D. (2011). How does your kindergarten classroom affect your earnings? Evidence from Project STAR. The Quarterly journal of economics, 126(4), 1593–1660. Retrieved from http://web.b.ebscohost.com.ccl.idm.oclc.org/ehost/pdfviewer/pdfviewer?vid=1&sid=7cff2524-2224-4076-bfe3-cd65b7400a92%40pdc-v-sessmgr01
Chukwuma, A. (2020, October 27). Personal interview [Personal interview].
Cole, S., Dhaliwal, I., Sautmann, A., & Vilhuber, L. (2020). Introduction. Handbook on Using Administrative Data for Research and Evidence-based Policy. Retrieved from https://admindatahandbook.mit.edu/book/v1.0-rc5/intro.html#ref-groves2011 on 2021–03–19.
Das, J. (February 24, 2021) Jishnu Das: Randomized Regulation: The Impact of Minimum Quality Standards on Health Markets [video file]. Retrieved from https://www.youtube.com/watch?v=eSyPxyzlR6E
Development Impact Evaluation (DIME). (2021). Transport. Retrieved February 20, 2021, from https://www.worldbank.org/en/research/dime/brief/transport
Doyle, J., Abraham, S., Feeney, L., Reimer, S., & Finkelstein, A. (2019). Clinical decision support for high-cost imaging: A randomized clinical trial. PloS one, 14(3), e0213373. Retrieved from https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213373
Fisher, T. (2020, November 3). Personal interview [Personal interview].
Gibreath, D. (2021, January 13). Personal interview [Personal interview].
GROVES, R. (2011, May 31). “Designed Data” and “Organic Data” [Web log post]. Retrieved February 28, 2021, from https://www.census.gov/newsroom/blogs/director/2011/05/designed-data-and-organic-data.html
Holla, A. (2021, January 15). Personal interview [Personal interview].
Holloway, K. (2020, October 21). Personal interview [Personal interview].
Künn, S. (2015). The challenges of linking survey and administrative data. IZA World of Labor.
Legovini, A., & Jones, R., J. (2020). Administrative Data in Research at the World Bank: The Case of Development Impact Evaluation (DIME). Handbook on Using Administrative Data for Research and Evidence-based Policy. Retrieved from https://admindatahandbook.mit.edu/book/v1.0-rc5/dime.html#ref-milusheva2020 on 2021–03–20.
Matthews, G. J., & Harel, O. (2011). Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Statistics Surveys, 5, 1–29. Retrieved from https://projecteuclid-org.ccl.idm.oclc.org/download/pdfview_1/euclid.ssu/1296828958
McManus, J. (2020, October 21). Personal interview [Personal interview].
Mitchell, H. (2020, October 5). Personal interview [Personal interview].
Naimpally, R. (2020, October 8). Personal interview [Personal interview].
Schochet, P. Z. (2020). Analyzing Grouped Administrative Data for RCTs Using Design-Based Methods. Journal of Educational and Behavioral Statistics, 45(1), 32–57. Retrieved from https://journals-sagepub-com.ccl.idm.oclc.org/doi/10.3102/1076998619855350
Williams, E. (2020, October 27). Personal interview [Personal interview].
World Health Organization. (2010) A Framework For National Health Policies, Strategies And Plans June 2010. Retrieved from https://www.who.int/nationalpolicies/FrameworkNHPSP_final_en.pdf
World Health Organization. (2014). Guide for developing national patient safety policy and strategic plan. Retrieved from https://apps.who.int/iris/bitstream/handle/10665/148352/9789290232070.pdf?sequence=1&isAllowed=y
Künn, S. (2015). The challenges of linking survey and administrative data. IZA World of Labor. Retrieved from https://wol.iza.org/articles/challenges-of-linking-survey-and-administrative-data/map
This is an interactive map that shows different case studies linking survey and administrative data.
J-PAL. (n.d.). Catalog of administrative data sets. Retrieved March 20, 2021, from https://www.povertyactionlab.org/catalog-administrative-data-sets
This is a catalog of admin data sets by J-PAL North America that is available to the public.