Balancing Inferential Integrity and Disclosure Risk: A Mixture Modeling Approach

In the context of survey sampling, Rubin (1993) proposed to release multiply imputed synthetic datasets with the target sensitive values replaced by values drawn from posterior predictive distributions under proper imputation models. However, information loss due to incorrect model specification can weaken or invalidate the inference obtained from synthetic data. We discuss a new masking framework through data augmentation as a potential remedy. The new framework can always guarantee valid inferences using synthetic datasets, and allows data users to obtain their desired data utility while satisfying disclosure requirements. This framework can be extended through mixture modelling and combined with other existing methods to accommodate different levels of disclosure protection. We demonstrate through simulations and an illustrative example that the new framework outperforms the classical multiple imputation approach to preserving data utility while providing good disclosure protection.

Session

Advances in model-based clustering of complex data

Date and Time