Design and Implementation of a Platform for Privacy Assessment of Synthetic Data Generation with Generative AI

State: Assigned to Karoline Siarsky
Published: 2024-03-04

In recent years, the fields of machine learning and deep learning have experienced significant advancements, profoundly impacting domains such as natural language processing, computer vision, and recommendation systems. However the machine intelligence empowered by deep learning relies heavily on extensive datasets for training, a process that involves the algorithm learning from vast amounts of information [1]. It is not always easy to obtain sufficiently large amounts of data with high quality. Domain-specific issues such as data scarcity and the prevalence of missing or incomplete information further compound these challenges, consequently compromising the accuracy and robustness of machine learning (ML) algorithms and data-driven methodologies. In addition, these datasets often contain sensitive and personal information, posing a serious risk to individuals' privacy. To tackle those issues, synthetic data generation has gained much attention as a data augmentation and privacy-preserving technique under the advances of generative AI. [2,3] . In principle the synthetic data generation invlovs the training of the generator with real data, and generating new samples from the trained generator.


However, how to properly measure the privacy preservation level of synthetic datasets is still a new and challenging problem [3]. Some studies propose distance-based metrics to assess the privacy protection level of the synthetic datasets [4] , while some studies measure privacy from an attacker’s point of view trying to conduct attacks on vulnerable synthetic records [5]. Despite these efforts, the absence of a standardized technical framework presents a significant obstacle in accurately quantifying the privacy preservation level of synthetic datasets. This thesis/project aims to review the existing privacy measurements and their limitations, propose novel privacy assessment metrics for synthetic data with generative AI, and develop a platform/framework for privacy assessment. Master's and bachelor's students with an interest and experience in machine learning, privacy computing, and software development are encouraged to apply; the topic can also be adjusted and split according to different kinds of work (MA, MP, or BA).

[1] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521, no. 7553 (2015): 436-444.

[2] Hernandez, Mikel, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. "Synthetic data generation for tabular health records: A systematic review." Neurocomputing 493 (2022): 28-45.

[3] Boudewijn, Alexander Theodorus Petrus, Andrea Filippo Ferraris, Daniele Panfilo, Vanessa Cocca, Sabrina Zinutti, Karel De Schepper, and Carlo Rossi Chauvenet. "Privacy Measurements in Tabular Synthetic Data: State of the Art and Future Research Directions." In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI. 2023.

[4] Zhao, Zilong, Aditya Kunar, Robert Birke, and Lydia Y. Chen. "Ctab-gan: Effective table data synthesizing." In Asian Conference on Machine Learning, pp. 97-112. PMLR, 2021

[5] Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. "The secret sharer: Evaluating and testing unintended memorization in neural networks." In 28th USENIX Security Symposium (USENIX Security 19), pp. 267-284. 2019.

Supervisors: Weijie Niu, Dr. Alberto Huertas
10% Literature study, 30% Design, 50% Implementation, 10% Documentation
Machine learning; Python

Supervisors: Weijie Niu

back to the main page