
Synthetic Data: Balancing Utility and Privacy
In a digital age where the demand for open data continues to grow, balancing data accessibility with individual privacy is more important than ever. At the Luxembourg National Data Service (LNDS), we worked with our partners at STATEC, Luxembourg’s national statistics office, to explore how valuable data, like census data, can be shared without putting individual privacy at risk. The solution? Synthetic data.
We have successfully implemented a methodology to generate synthetic versions of the country’s 2021 census data. The project sets a strong precedent for how sensitive data can be reused safely for the public good. In this article, we take you through our approach, key challenges, and what we have learned along the way.
Why Synthetic Data Matters
With over 643,000 residents counted, a roughly 25% demographic increase since 2011, and a clearly aging population, the 2021 population census provides rich insights into Luxembourg’s social and economic structure. Its scale and diversity made it an ideal test case for exploring how synthetic data can be generated, validated, and used safely.
At the same time, the challenge is clear: how can highly sensitive personal data, such as census information, be shared in a way that supports research and innovation, without breaching confidentiality laws such as the EU’s General Data Protection Regulation (GDPR)? Synthetic data offers a practical solution when it meets both legal standards and scientific needs.
One key application of the synthetic data is enabling researchers to conduct exploratory data analysis and develop initial hypotheses while they await access to the original data. This was the primary objective of our project. Beyond this, synthetic data also supports broader use cases such as the public release of microdata, technology testing, and educational purposes.
In this project, we focused on key variables such as age, sex, employment status, education, occupation, industry, and language. These factors help illustrate how people live and work in Luxembourg and support planning in areas like healthcare and education. By demonstrating that synthetic data performs well for these variables, the project paves the way for applying the same methodology to additional datasets in the future, based on specific use cases.
What is Synthetic Data?
Synthetic data is not real, but it is not random either. It is designed to look and behave like real data, just without containing actual personal information. In this project, synthetic data allows researchers and policymakers to gain some insights into Luxembourg’s population without putting personal information at risk.
According to the formal definition by the European Data Protection Supervisor (EDPS), “Synthetic data is artificial data that is generated from original data and a model that is trained to reproduce the characteristics and structure of the original data.“
Let’s take a closer look at how synthetic data works in practice.
How We Created the Synthetic Data
Creating synthetic data means generating new, artificial records that mimic the patterns and relationships found in real data, without using any actual personal information. To do this, different methods can be used, ranging from basic statistical models to advanced deep learning techniques. Each method offers a different balance between realism, complexity, and privacy. In this project, several approaches were studied before selecting the one that best fit the needs of Luxembourg’s census data.
Evaluating methods
Before selecting the final approach, we evaluated several state-of-the-art methods for synthetic data generation, such as:
- Simulated Data Methods: Uses basic statistics to create data, but too simple to reflect real-life patterns.
- Generative Adversarial Networks (GANs): Powerful deep learning models that can create highly realistic data, but are hard to interpret, test, and control, which makes them risky for privacy-sensitive data. Besides, it can be computationally expensive.
- Hierarchical Bayesian Models: Good for nested data like households, but computationally expensive for big datasets like a full census, and require expert knowledge for each synthetic data generation, i.e., it is costly to scale up.
- Fully Conditional Specification (FCS): Flexible and widely used. Works well with various types of data and offers a good balance between usefulness, simplicity, and privacy while giving the data holder full control over the generated data – which is why it was chosen.
The chosen approach: Fully Conditional Specification with Synthpop
After evaluating several methods, we selected Fully Conditional Specification (FCS) as the core technique for synthesising data. This method builds synthetic data step-by-step, modelling each variable based on those already created, and is particularly effective for datasets containing a mix of categorical and numerical features. It is implemented by National Statistical Offices, such as Statistics Canada, Statistics New Zealand, and the UK’s Office for National Statistics through the use of Synthpop.
To bring this method to life, we also used Synthpop – a widely respected open-source tool developed for creating synthetic data. Synthpop relies on Classification and Regression Trees (CART) to model relationships between the variables of interest while maintaining data structure and ensuring realism. To better understand the process, let us assume that we are interested in a dataset containing age group, sex, and occupation. An illustration below shows a simplified tree on how to synthetically generate job status after having generated age and sex:

The detailed technical report is expected to be published by STATEC soon.
Collaboration: Trusted Data Access and Technical Implementation
At LNDS, we made this initiative possible together with our partners at STATEC by building a strong, trust-based collaboration that combined legal, technical, and methodological expertise.
The STATEC team provided secure access to a carefully prepared and pseudonymised subset of the 2021 census data, enabling raw data extraction under strict privacy controls. Our team in LNDS then designed and implemented the synthetic data generation process within a safe environment on STATEC’s premises, ensuring that all steps were aligned with GDPR requirements and statistical best practices.
Our collaboration with LNDS has been instrumental in initiating our exploration of synthetic data, allowing us to pursue innovative and secure approaches to data sharing.
Claude Lamboray, Méthodologie Conseiller at STATEC, Luxembourg’s national statistics office
The two teams jointly established clear legal, technical, and procedural safeguards to guarantee transparency, reproducibility, and data protection. The Synthpop tool was further customised to include validation rules and tuning parameters tailored to preserve key data patterns while reducing disclosure risk. This collaboration demonstrates how trust-based governance and technical rigour can enable responsible data reuse.
How do we protect privacy
Privacy risks such as singling out, linkability, and inference were carefully assessed and mitigated through the process. To minimise these risks, we were careful to remove or group unique or rare records, for example, by using age ranges instead of exact values. Additionally, we included randomness to ensure that the synthetic data does not “copy” the original people. Also, we made sure the data cannot be linked back to individuals, even in cases where an attacker might already know some personal details, such as a person’s age and sex.
Testing results: does it really work?
We ran utility tests to check if the synthetic data resembles the real one. We also conducted privacy risk tests to keep disclosure risk minimal.
The testing confirmed that no unique records from the original data were copied, and the risk of identifying individuals was as minimal as possible. The approach also met all three primary privacy risk criteria set by EU guidance – singling out, linkability, and inference – confirming GDPR alignment.
These results show that synthetic census data can provide strong privacy protection while still supporting meaningful statistical analysis. For purposes like data exploration, model testing, or visualisation, the synthetic data proved to be a safe and reliable alternative to accessing real microdata. It’s worth mentioning, that final decisions should be made based on the real data.
Real-World Impact
The success of this project gives our partners in STATEC a clear path forward for sharing high-quality synthetic data with researchers, public institutions, and policymakers without compromising individual privacy. For sectors like healthcare, education, housing, or employment, access to such data can directly support better planning of evidence-based decision-making analysis to be done on real data in a later stage.
Thanks to the flexibility of the chosen method, the approach can also be scaled to cover additional census variables or even applied to entirely different datasets in the future. This opens the door to more robust and secure data-sharing practices in Luxembourg and sets an example that other national institutions may follow.
By proving that synthetic data can retain real value while protecting privacy, this project represents not just a technical achievement but a model for responsible and practical data reuse in the public interest.
A Path Forward for Ethical Data Reuse
This initiative lays a strong foundation for ethical and secure data reuse in Luxembourg. It demonstrates that synthetic data can deliver both privacy protection and analytical value, enabling a future where high-quality data supports innovation, research, and public policy without compromising personal rights.
Now is the time to build on this momentum – expanding the use of synthetic data, encouraging cross-sector collaboration, and continuing to unlock its potential for the public good. If you’re ready to explore new ways to work with data, start a project with us.