Tackling simpson's paradox in big data using classification & regression trees

Galit Shmueli, Inbal Yahav

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This work is aimed at finding potential Simpson's paradoxes in Big Data. Simpson's paradox (SP) arises when choosing the level of data aggregation for causal inference. It describes the phenomenon where the direction of a cause on an effect is reversed when examining the aggregate vs. disaggregates of a sample or population. The practical decision making dilemma that SP raises is which level of data aggregation presents the right answer. We propose a tree-based approach for detecting SP in data. Classification and regression trees are popular predictive algorithms that capture relationships between an outcome and set of inputs. They are used for record-level predictions and for variable selection. We introduce a novel usage for a cause-and-effect scenario with potential confounding variables. A tree is used to capture the relationship between the effect and the set of cause and potential confounders. We show that the tree structure determines whether a paradox is possible. The resulting tree graphically displays potential confounders and the confounding direction, allowing researchers or decision makers identify potential SPs to be further investigated with a causal toolkit. We illustrate our SP detection approach using real data for both a single confounder and for multiple confounder in a large dataset on Kidney transplant waiting time.

Original languageEnglish
Title of host publicationECIS 2014 Proceedings - 22nd European Conference on Information Systems
PublisherAssociation for Information Systems
ISBN (Print)9780991556700
StatePublished - 2014
Event22nd European Conference on Information Systems, ECIS 2014 - Tel Aviv, Israel
Duration: 9 Jun 201411 Jun 2014

Publication series

NameECIS 2014 Proceedings - 22nd European Conference on Information Systems

Conference

Conference22nd European Conference on Information Systems, ECIS 2014
Country/TerritoryIsrael
CityTel Aviv
Period9/06/1411/06/14

Keywords

  • Big Data
  • CART
  • Casual Effect
  • Classification and Regression Trees
  • Simpson's Paradox

Fingerprint

Dive into the research topics of 'Tackling simpson's paradox in big data using classification & regression trees'. Together they form a unique fingerprint.

Cite this