MCMC Framework for Learning Bayesian Decision Trees: MATLAB Implementation and Benchmarking
Description
# Bayesian Decision Trees MATLAB Package
## Description
This MATLAB package provides tools to benchmark and compare **Bayesian Decision Trees (BDT)** and **Random Forests (RF)** for decision-making under uncertainty caused by missing data. The package, designed for researchers and practitioners in machine learning and statistics, supports experiments on benchmark datasets with configurable missing data types and levels. The main entry point, `experiment_planner.m`, orchestrates data loading, missing data simulation, k-fold cross-validation, and performance reporting. This package is ideal for applications in fields like finance, healthcare, ecology, and agriculture, where robust predictions despite incomplete data are critical.
## Features
- **Experiment Setup**: Run single or group experiments comparing BDT and RF on datasets with missing data.
- **Datasets**: Includes synthetic (XOR3) and real-world datasets (HEART, CREDIT, LIQ).
- **Missing Data Simulation**: Supports NA, MCAR, MAR, and MNAR missingness types at levels from 0 to 50%.
- **Performance Metrics**: Evaluates models using accuracy, F1 score, TPR, TNR, and entropy.
- **Cross-Validation**: Implements k-fold cross-validation for robust model evaluation.
- **Reporting**: Generates JSON reports and statistical summaries of model performance.
## Installation
1. **Prerequisites**:
- MATLAB with Statistics and Machine Learning Toolbox (for `TreeBagger` in RF).
- Ensure helper functions (e.g., `settings_of_methods.m`, `rf_cross_validation.m`, `bdt_cross_validation.m`, `simulate_missing_data.m`, `cv_data_folds.m`, `performance_metrics.m`, `save_BDT_report.m`, `save_RF_report.m`, `report_performance_tests.m`) are in the MATLAB path.
- Data files: `heart_failure_clinical_records_dataset.csv` (UCI Heart Failure), `default_of_credit_card_clients.xls` (UCI Credit Card Default), `data_company.csv` (custom company dataset). Place in `data/` directory.
- JSON file: `prop_ratio.json` with BDT proposal ratios (`R1`, `R2`).
2. **Steps**:
- Clone or download the repository from ZENODO.
- Add the repository folder to your MATLAB path:
```matlab
addpath('/path/to/repository');
```
- Verify data files and `prop_ratio.json` are in the correct directory or update paths in `load_data` (in `experiment_planner.m`).
## Usage
The `experiment_planner.m` function is the primary interface for running experiments in two modes:
- **Single Experiment**: Compares BDT and RF on a single dataset with specified missingness parameters.
- **Group Experiment**: Benchmarks BDT and RF across multiple datasets, missingness types, and levels.
### Single Experiment
1. Open `experiment_planner.m` and configure the `p` struct:
```matlab
p.bench_index = 1; % 0: XOR3, 1: HEART, 2: CREDIT, 3: LIQ
p.mis_type = 3; % 0: NA, 1: MCAR, 2: MAR, 3: MNAR
p.mis_lev = 0.1; % 10% missingness
p.nf = 7; % 7-fold cross-validation
p.group = 0; % Single experiment mode
```
2. Run:
```matlab
experiment_planner();
```
3. **Output**:
- JSON reports (`reports/report_BDT_000.json`, `reports/report_RF_000.json`).
- Statistical test results.
### Group Experiment
1. Set `p.group = 1` in `experiment_planner.m`.
2. Run:
```matlab
experiment_planner();
```
3. **Output**:
- JSON reports for each combination of dataset, missingness type (NA, MCAR, MAR, MNAR), and level (0.1, 0.25, except NA at 0).
- Statistical comparisons.
### Summarizing Results
Run:
```matlab
summary_of_cross_validation_benchmarking();
```
This generates `summary_of_cross_validation_benchmarking.txt`, comparing BDT and RF performance (accuracy, F1, entropy) with statistical significance (p < 0.05).
## Data
- **XOR3**: Synthetic dataset (1000 samples, 3 features: X1, X2 for XOR, X3 dummy).
- **HEART**: UCI Heart Failure Clinical Records (CSV).
- **CREDIT**: UCI Default of Credit Card Clients (Excel).
- **LIQ**: Custom company dataset (CSV, semicolon-separated).
Update `load_data` in `experiment_planner.m` if data paths differ.
## Configuration
The `p` struct in `experiment_planner.m` controls:
- `p.bench_index`: Dataset (0: XOR3, 1: HEART, 2: CREDIT, 3: LIQ).
- `p.mis_type`: Missingness (0: NA, 1: MCAR, 2: MAR, 3: MNAR).
- `p.mis_lev`: Missingness level (0–0.5).
- `p.nf`: Number of cross-validation folds.
- `p.group`: Mode (0: single, 1: group).
- Method settings (in `settings_of_methods.m`):
- **BDT**: MCMC parameters (`nb` burn-in, `np` post-burn-in, `Pr` proposal probabilities).
- **RF**: TreeBagger parameters (`nTrees`, `br` bootstrap ratio, `vr` variable sampling ratio).
## Outputs
- **JSON Reports**: Stored in `reports/` as `report_BDT_XXX.json` and `report_RF_XXX.json`, detailing metrics per fold.
- **Summary File**: Statistical comparisons in `summary_of_cross_validation_benchmarking.txt`.
## Troubleshooting
- **File Not Found**: Check data file paths and `prop_ratio.json`.
- **Missing Functions**: Ensure all helper functions are in the MATLAB path.
- **TreeBagger Errors**: Verify Statistics and Machine Learning Toolbox installation.
- **Memory Issues**: Large datasets or high `nTrees`/`nb` values may require optimization.
## Extending the Package
- **New Datasets**: Add cases to `load_data` and update `settings_of_methods.m`.
- **New Metrics**: Modify `rf_cross_validation.m` and `bdt_cross_validation.m`, then update `summary_of_cross_validation_benchmarking.m`.
## License
Licensed under the MIT License. See `LICENSE` file for details.
## Contact
For issues or contributions, open a ZENODO issue or email vitaly.schetinin@gmail.com.
Files
Bayesian_Decision_Trees.zip
Files
(2.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:84c01c4944b63d1d2621491afa0bf043
|
2.5 MB | Preview Download |