| Title: | A Curated Collection of 'Causal Inference' Datasets and Tools |
|---|---|
| Description: | Provides a comprehensive set of datasets and tools for 'causal inference' research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies. Designed to facilitate causal analysis, risk assessment, and advanced statistical modeling, it leverages datasets from packages such as 'causalOT', 'survival', 'causalPAF', 'evident', 'melt', and 'sanon'. The package is inspired by the foundational work of Pearl (2009) <doi:10.1017/CBO9780511803161> on causal inference frameworks. |
| Authors: | Tomás Valderrama [aut, cre] |
| Maintainer: | Tomás Valderrama <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-06-07 07:24:45 UTC |
| Source: | https://github.com/toby-codigos/forcausality |
This dataset, Benzene_df, is a data frame containing indicators of chromosome damage related to benzene exposure, alcohol consumption, and smoking habits. The dataset consists of 78 observations and 5 variables, including age, exposure, and lifestyle factors. Some observations may contain missing values.
data(Benzene_df)data(Benzene_df)
A data frame with 78 observations and 5 variables:
Age of the subject (integer)
Benzene exposure indicator (integer)
Alcohol consumption indicator (integer)
Smoking indicator (numeric)
Chromosome damage measure (numeric)
The dataset name has been kept as 'Benzene_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the evident package version 1.0.4
This dataset, Cloth_df, is a data frame containing measurements of clothianidin concentration in maize plants under different treatments. The dataset consists of 102 observations and 3 variables, including block identifiers, treatment types, and measured concentrations. Some observations may contain missing values.
data(Cloth_df)data(Cloth_df)
A data frame with 102 observations and 3 variables:
Block identifier (factor)
Treatment type (factor)
Clothianidin concentration (numeric)
The dataset name has been kept as 'Cloth_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the melt package version 1.11.4
This dataset, Colon_df, contains data from a clinical trial of chemotherapy for patients with Stage B/C colon cancer. The dataset includes 1,858 observations and 16 variables, providing information on patient demographics, treatment assignment, disease characteristics, and outcomes. Some observations contain missing values.
data(Colon_df)data(Colon_df)
A data frame with 1,858 observations and 16 variables:
Patient identifier (numeric)
Study number (numeric)
Treatment group (factor)
Sex of the patient (numeric)
Age of the patient in years (numeric)
Obstruction present (numeric indicator)
Perforation present (numeric indicator)
Adherence to adjacent structures (numeric indicator)
Number of lymph nodes with cancer (numeric)
Patient status (numeric indicator)
Tumor differentiation (numeric)
Extent of local spread (numeric)
Surgical procedure performed (numeric indicator)
At least 4 nodes positive (numeric indicator)
Follow-up time in days (numeric)
Type of event (numeric indicator)
The dataset name has been kept as 'Colon_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the survival package version 3.8-3
Provides a comprehensive set of datasets and tools for causal inference research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies.
ForCausality: A Curated Collection of Causal Inference Datasets and Tools
A Curated Collection of Causal Inference Datasets and Tools
Maintainer: Tomás Valderrama [email protected]
Useful links:
This dataset, Gbsg_df, provides prognostic factors for breast cancer patients from the German Breast Cancer Study Group (GBSG). The dataset includes 686 observations and 11 variables, containing information on patient demographics, tumor characteristics, hormone receptor status, and outcomes. Some observations contain missing values.
data(Gbsg_df)data(Gbsg_df)
A data frame with 686 observations and 11 variables:
Patient identifier (integer)
Age at diagnosis (integer)
Menopausal status (integer indicator)
Tumor size in millimeters (integer)
Tumor grade (integer)
Number of positive lymph nodes (integer)
Progesterone receptor level (integer)
Estrogen receptor level (integer)
Hormonal therapy received (integer indicator)
Relapse-free survival time in days (integer)
Patient status (integer indicator)
The dataset name has been kept as 'Gbsg_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the survival package version 3.8-3
This dataset, Lead_df, is a data frame comparing control and exposed groups under different hygiene and exposure levels. The dataset consists of 33 observations and 6 variables, including measures of exposure, hygiene, and calculated differences between groups. Some observations may contain missing values.
data(Lead_df)data(Lead_df)
A data frame with 33 observations and 6 variables:
Control group count (integer)
Exposed group count (integer)
Exposure level (factor with 3 levels: "high", "low", "medium")
Hygiene level (factor with 3 levels: "good", "mod", "poor")
Combined exposure and hygiene category (factor with 4 levels, e.g. "high.ok", "high.poor", ...)
Difference between control and exposed (integer)
The dataset name has been kept as 'Lead_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the evident package version 1.0.4
This dataset, Mouse_df, provides data from mouse cancer trials used in studies by Royston and Altman. The dataset includes 181 observations and 4 variables, covering information on treatment assignment, survival time, outcome, and mouse identifiers. Some observations contain missing values.
data(Mouse_df)data(Mouse_df)
A data frame with 181 observations and 4 variables:
Treatment group (factor)
Survival time in days (numeric)
Trial outcome (factor)
Mouse identifier (integer)
The dataset name has been kept as 'Mouse_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the survival package version 3.8-3
This dataset, Pain_df, is a data frame containing clinical trial data for chronic pain treatments. The trial compared active treatment versus placebo across different clinical centers and diagnoses. The dataset consists of 193 observations and 4 variables. Some observations may contain missing values.
data(Pain_df)data(Pain_df)
A data frame with 193 observations and 4 variables:
Treatment group (factor: active vs placebo)
Response outcome (factor)
Clinical trial center (factor)
Diagnosis category (factor)
The dataset name has been kept as 'Pain_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the sanon package version 1.6
This dataset, Periodontal_df, is a data frame containing information on smoking habits, demographics, and periodontal health indicators. The dataset consists of 882 observations and 12 variables, including smoking frequency, socioeconomic indicators, and periodontal measures. Some observations may contain missing values.
data(Periodontal_df)data(Periodontal_df)
A data frame with 882 observations and 12 variables:
Sequence identifier (numeric)
Sex indicator (numeric)
Age in years (numeric)
Race indicator for Black participants (numeric)
Education level (ordered factor with 5 levels)
Income measure (numeric)
Cigarettes smoked per day (numeric)
Count of sites with periodontal disease (integer)
Count of sites without periodontal disease (integer)
Percentage of sites with periodontal disease (numeric)
Standardized measure (numeric)
Additional periodontal health indicator (numeric)
The dataset name has been kept as 'Periodontal_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the evident package version 1.0.4
This dataset, Pph_df, provides data from an external control trial of treatments for post-partum hemorrhage. The dataset includes 802 observations and 17 variables, containing information on blood loss, treatment assignment, demographic characteristics, and educational background. Some observations contain missing values.
data(Pph_df)data(Pph_df)
A data frame with 802 observations and 17 variables:
Cumulative blood loss at 20 minutes (numeric)
Treatment indicator (numeric)
Age of the participant (numeric)
Indicator for no formal education (numeric)
Additional variables related to treatment and outcomes (numeric)
The dataset name has been kept as 'Pph_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the causalOT package version 1.0.2
This dataset, Resp_df, is a data frame containing repeated measurements from a clinical trial on respiratory disorders under two treatment conditions. The dataset records demographic information (center, sex, age), baseline measures, and follow-up measurements across four visits. It consists of 111 observations and 9 variables. Some observations may contain missing values.
data(Resp_df)data(Resp_df)
A data frame with 111 observations and 9 variables:
Clinical trial center (factor)
Treatment group (character)
Sex of the participant (character)
Age of the participant (integer)
Baseline measurement (integer)
Measurement at visit 1 (integer)
Measurement at visit 2 (integer)
Measurement at visit 3 (integer)
Measurement at visit 4 (integer)
The dataset name has been kept as 'Resp_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the sanon package version 1.6
This dataset, Rotterdam_df, provides prognostic factors for breast cancer patients used in the studies of Royston and Altman. The dataset includes 2,982 observations and 15 variables, covering patient demographics, tumor characteristics, treatments, and outcomes. Some observations contain missing values.
data(Rotterdam_df)data(Rotterdam_df)
A data frame with 2,982 observations and 15 variables:
Patient identifier (integer)
Year of surgery (integer)
Age at diagnosis (integer)
Menopausal status (integer indicator)
Tumor size category (factor)
Tumor grade (integer)
Number of positive lymph nodes (integer)
Progesterone receptor level (integer)
Estrogen receptor level (integer)
Hormonal therapy received (integer indicator)
Chemotherapy received (integer indicator)
Relapse-free survival time in days (numeric)
Recurrence indicator (integer)
Time to death in days (numeric)
Death indicator (integer)
The dataset name has been kept as 'Rotterdam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the survival package version 3.8-3
This dataset, Sebor_df, is a data frame containing clinical trial data on seborrheic dermatitis, comparing test and placebo treatments. It records participant center, treatment assignment, dermatitis scores across three assessments, and severity indicators at the same points. The dataset consists of 167 observations and 8 variables. Some observations may contain missing values.
data(Sebor_df)data(Sebor_df)
A data frame with 167 observations and 8 variables:
Clinical trial center (factor)
Treatment group: test or placebo (character)
Dermatitis score at assessment 1 (integer)
Dermatitis score at assessment 2 (integer)
Dermatitis score at assessment 3 (integer)
Severity indicator at assessment 1 (integer)
Severity indicator at assessment 2 (integer)
Severity indicator at assessment 3 (integer)
The dataset name has been kept as 'Sebor_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the sanon package version 1.6
This dataset, Skin_df, is a data frame containing clinical trial data on skin conditions, comparing responses under placebo and test treatments. It includes participant center, treatment assignment, disease stage, and responses across three assessments. The dataset consists of 172 observations and 6 variables. Some observations may contain missing values.
data(Skin_df)data(Skin_df)
A data frame with 172 observations and 6 variables:
Clinical trial center (factor)
Treatment group: placebo or test (factor)
Disease stage (integer)
Response at assessment 1 (integer)
Response at assessment 2 (integer)
Response at assessment 3 (integer)
The dataset name has been kept as 'Skin_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the sanon package version 1.6
This dataset, SmokeH_df, is a data frame containing information on smoking, homocysteine levels, demographics, and socioeconomic indicators. The dataset consists of 2,475 observations and 15 variables, including biomarkers, smoking-related measures, age, education, and poverty ratio. Some observations contain missing values.
data(SmokeH_df)data(SmokeH_df)
A data frame with 2,475 observations and 15 variables:
Participant identifier (integer)
Homocysteine level (numeric)
Z score indicator (integer)
Sex indicator (integer, 1 = female, 0 = male)
Age in years (integer)
Education level (integer code)
Poverty ratio (numeric)
Body mass index (numeric)
Cotinine level (numeric)
Smoking type indicator (integer)
Smoking type (character string)
Age category (integer code)
Education category (integer code)
BMI category (integer code)
Poverty category (logical)
The dataset name has been kept as 'SmokeH_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the evident package version 1.0.4
This dataset, Stroke_df, contains fictional case-control data for ischemic stroke, including exposures, risk factors, and confounders. The dataset includes 16,623 observations and 21 variables, covering demographic details, lifestyle factors, biomarkers, and comorbidities. Some observations contain missing values.
data(Stroke_df)data(Stroke_df)
A data frame with 16,623 observations and 21 variables:
Geographic region (factor)
Case indicator for ischemic stroke (numeric)
Sex of the participant (integer)
Age of the participant (integer)
Hypertension or blood pressure measure (numeric)
Smoking status (factor)
Perceived stress indicator (factor)
Waist-to-hip ratio tertiles (factor)
Physical activity indicator (factor)
Weekly alcohol consumption frequency (factor)
Diabetes / HbA1c category (factor)
Cardiac risk factor category (factor)
Alternative Healthy Eating Index tertiles (factor)
ApoB/ApoA ratio tertiles (factor)
Sub-education level (factor)
Mother’s education level (factor)
Father’s education level (factor)
Sub-hypertension indicator (factor)
Waist-to-hip ratio (numeric)
ApoB/ApoA continuous ratio (numeric)
Sample weights (numeric)
The dataset name has been kept as 'Stroke_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the causalPAF package version 1.2.5
This dataset, Thiam_df, is a data frame containing information on thiamethoxam applications and crop yield measurements in squash plants. The dataset consists of 165 observations and 11 variables, including treatment types, plant variety, replication, fruit counts, yield measures, and defoliation indicators. Some observations may contain missing values.
data(Thiam_df)data(Thiam_df)
A data frame with 165 observations and 11 variables:
Treatment type (factor)
Plant variety (factor)
Replication block (factor)
Number of fruits (numeric)
Average fruit mass (numeric)
Total fruit mass (numeric)
Crop yield (numeric)
Pollinator visit count (numeric)
Foliage measure (numeric)
Squash vine borer damage (numeric)
Defoliation percentage (numeric)
The dataset name has been kept as 'Thiam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the melt package version 1.11.4
This dataset, Udca_df, contains data from a clinical trial of ursodeoxycholic acid (UDCA). The dataset includes 1,360 observations and 8 variables, covering treatment assignment, disease stage, bilirubin levels, risk scores, follow-up time, and outcomes. Some observations contain missing values.
data(Udca_df)data(Udca_df)
A data frame with 1,360 observations and 8 variables:
Patient identifier (integer)
Treatment group (integer)
Disease stage (integer)
Bilirubin level (numeric)
Calculated risk score (numeric)
Follow-up time in days (numeric)
Patient status indicator (numeric)
Endpoint description (character)
The dataset name has been kept as 'Udca_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Data taken from the survival package version 3.8-3