FIDELITY AND UTILITY OF SYNTHETIC TABULAR HEALTH DATA: A CROSS-PARADIGM BENCHMARKING OF A CORRELATION-PRESERVING STATISTICAL PIPELINE AND A CONDITIONAL GENERATIVE ADVERSARIAL NETWORK

Ishrat Fatima; Najwa Liaqat; Mahnoor Saeed

Authors

Ishrat Fatima
Najwa Liaqat
Mahnoor Saeed

Keywords:

Synthetic data generation; generative adversarial networks; CTGAN; statistical simulation; tabular health data; Modgo; correlation preservation; downstream utility; preprocessing sensitivity

Abstract

Background: The generation of synthetic tabular health data has emerged as a pivotal methodology for statistical simulation, privacy preservation, and methodological validation. While classical statistical approaches offer interpretable, correlation-preserving synthesis through explicit parametric modeling, generative adversarial networks (GANs) provide a flexible, non-parametric alternative that learns the full joint distribution directly from empirical data. Despite the rapidly expanding application of both paradigms, a rigorous, cross-disciplinary empirical comparison tailored to the specific challenges of mixed-type health datasets with small-to-moderate sample sizes remains absent. Aims & Objective: This study evaluates whether a GAN-based approach (CTGAN) offers demonstrable advantages over a modern classical statistical pipeline for synthetic health data generation, and under which practical conditions each method excels. Methodology: Six publicly available tabular health datasets were selected, encompassing continuous, binary, and time-to-event outcomes with mixed continuous and categorical predictors. The statistical pipeline comprised the Modgo method for covariate matrix simulation, followed by parametric model-based outcome generation (linear, logistic, or Cox models). The CTGAN architecture was trained for 1000 epochs. Synthetic data quality was assessed through quantitative distributional similarity metrics (moments, Kullback–Leibler divergence, Kolmogorov–Smirnov statistic, Jensen–Shannon distance, Wasserstein distance), categorical concordance, pairwise correlation recovery, and downstream predictive utility measured via mean squared error, balanced accuracy, and concordance index. Sensitivity to variable scaling and training epoch selection was systematically examined. Results & Findings: Both methods successfully reproduced univariate distributions and joint dependence structures across all datasets. The statistical approach demonstrated superior fidelity in recovering pairwise linear correlations, attributable to its explicit correlation matrix estimation. Consequently, it yielded marginally higher prediction performance for binary classification (balanced accuracy 0.85 vs. 0.78) and survival analysis (C-index 0.71 vs. 0.65). The CTGAN approach achieved lower mean squared error in linear regression (1.02 vs. 2.82) and exhibited pronounced robustness to feature scaling, in contrast to the statistical method which failed catastrophically without prior standardization. CTGAN-generated densities occasionally displayed discretized artifacts, particularly in small samples. Both methods performed comparably in categorical distribution recovery and overall downstream utility. Conclusions: Classical statistical and GAN-based synthetic data generation methods are both viable tools for health simulation studies, each with distinct operational strengths. The statistical pipeline is recommended when the precise reproduction of known correlation structures or parametric outcome relationships is critical, provided that rigorous data preprocessing is undertaken. The CTGAN approach is preferable for datasets with complex multimodality, heavy tails, or when minimal preprocessing and automated implementation are desired. These findings sanction the principled integration of AI-based generators into the statistical simulation repertoire and underscore the importance of context-dependent method selection. Future research should prioritize the incorporation of privacy preservation mechanisms and the development of hybrid architectures that couple the interpretability of statistical models with the distributional flexibility of deep generative networks.