Causal Inference Part 9: Regression Discontinuity Design for Causal Inference in Data Science

Rudrendu Paul
4 min readJan 29, 2023

--

RDD as robust approach for causal inference, understanding its implementation, application, and limitations

Photo by Boitumelo Phetla on Unsplash

Introduction

In data science, understanding causality is crucial for making accurate predictions and taking effective actions. However, inferring causality from observational data can be a complex and challenging task. There are several limitations and potential sources of bias to take into account when trying to establish causality.

In recent years, Regression Discontinuity Design (RDD) has emerged as a powerful tool for inferring causality from observational data. The RDD is a powerful approach for counterfactual analysis, it allows researchers to estimate the causal effect of a treatment by comparing the outcomes of individuals who just meet the cutoff for receiving the treatment to those who just miss it.

In this article, we will explore the basics of RDD, its implementation, applications, and the challenges and best practices for its use in causal inference in data science.

The Basics of Regression Discontinuity Design

RDD is a type of quasi-experimental design that allows researchers to estimate the causal effect of a treatment by comparing the outcomes of individuals who just meet the cutoff for receiving the treatment to those who just miss it.

Assumptions

The design of RDD is based on the assumption that the treatment is randomly assigned to individuals within a certain range of a continuous running variable (the running variable is a variable that is used to define the cutoff point, such as age, test score or income). The range around the cutoff point is called the “bandwidth”, which is typically chosen by the researcher.

RDD is considered a powerful alternative to traditional methods such as randomized controlled trials, which can be expensive, time-consuming and sometimes infeasible. It is particularly useful when the treatment of interest is determined by a threshold such as pass/fail, cut off points, or eligibility criteria. RDD is widely used in various fields, such as education, health, and labor, to estimate the causal effect of different interventions.

Implementing Regression Discontinuity Design

Implementing RDD involves several steps:

  1. The first step is to identify the appropriate RDD design by selecting the running variable, the cutoff point, and the bandwidth.
  2. The next step is to estimate the counterfactual effect of the treatment using the RDD, typically using methods such as local linear regression.
  3. Finally, the results are interpreted, and the causal effect is inferred.
  4. The estimation method of RDD can vary, most commonly used methods are the local linear regression, kernel regression, and polynomial regression, each method has its own assumptions and limitations, and the appropriate method should be chosen based on the specific research question and data set.
  5. It’s important to choose an appropriate bandwidth, if the bandwidth is too narrow, the estimation will be too variable and if it is too wide, the design will lose its discontinuity. This process can be automated by using methods such as cross-validation or by using optimal bandwidth selection methods such as the Lepage and Rihll-Kiviet methods.

Applications of Regression Discontinuity Design

RDD has been applied in various fields, such as education, health, and labor, to estimate the causal effect of different interventions.

In the field of education, RDD has been used to evaluate the effectiveness of different educational programs, such as tutoring programs, by controlling for the confounding bias.

In health, RDD has been used to understand the impact of medical treatments on health outcomes, such as the effects of different drugs on disease progression. Additionally, RDD has been applied in social science and other fields, to estimate the causal effect of different interventions, such as educational programs and policies on human outcomes.

Challenges and Best Practices in Regression Discontinuity Design

Despite its strengths, RDD is not without its challenges:

  1. One of the main challenges is the weak instrument problem, which occurs when the running variable is not strongly correlated with the treatment variable.
  2. Additionally, measurement errors can also bias the results, particularly when the running variable or the outcome variable is not perfectly measured.
  3. Spillover effects, boundary problem and the role of assumptions and robustness are also important factors that should be considered.

To overcome these challenges, it is important to use appropriate methods and best practices when implementing RDD.

For example, sensitivity analysis and robust standard errors can be used to evaluate the robustness of the results to different assumptions and uncertainties. Additionally, multiple imputation or weighting methods can be used to handle measurement errors.

Another best practice is to use transparency in terms of methods and assumptions used in the analysis and report the results and conclusion accordingly. Additionally, it is important to pre-register the study design and analysis plan in order to minimize bias.

Furthermore, it’s important to have an understanding of the underlying causal assumptions that need to be met for an RDD to be valid and the trade-offs and limitation of the chosen method.

Conclusion

In this article, we have explored the basics of Regression Discontinuity Design, its implementation, applications, and the challenges and best practices for its use in causal inference in data science.

RDD is a powerful tool for inferring causality from observational data and has many applications in various fields. However, inferring causality from observational data can be complex and challenging, and RDD has its own assumptions and limitations.

By using appropriate methods, careful consideration of limitations, and best practices, researchers can draw valid conclusions and make better predictions and decisions. The use of RDD can provide a powerful tool to estimate causal effects and improve the overall understanding of the underlying mechanisms in the data.

Connect with the Author

If you enjoyed this article and would like to stay connected, feel free to follow me on Medium and connect with me on LinkedIn. I’d love to continue the conversation and hear your thoughts on this topic.

References

  1. https://towardsdatascience.com/establishing-causality-part-3-3e8f8c546f9a
  2. https://en.wikipedia.org/wiki/Regression_discontinuity_design

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Rudrendu Paul
Rudrendu Paul

Written by Rudrendu Paul

Data Science Leader | Ex-PayPal | Ads | Applied AI/ML | MBA | E-commerce | Retail | Judge at Startup Competitions | Reviewer Springer, Elsevier, IEEE | Speaker

No responses yet

Write a response