This synthetic data must meet two requirements: First, it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists. Second, it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.
Use Case
- Synthetic data techniques can create all the data needed to satisfy the needs of data hungry machine learning algorithms.
- Synthetic data generation is also a method for making the data needed to stress test a system.
- Synthetic data can change existing biases in data, thereby (e.g.) removing data discrimination.
- Synthetic data can be used to impute missing information in existing data.
- Synthetic data generated can be used to enable data sharing, without incurring the wrath of legislative bodies. In this way, organizations can share insights, thereby assisting in scientific reasoning.
- When most people work with the synthetic , not the real data, then this increases data security.
- Synthetic data can exploit current trends in data, thereby supporting forecasting.
Reference List
- Patki, N., Wedge, R., & Veeramachaneni, K. (2016, October). The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 399-410). IEEE.
- https://www.computer.org/csdl/magazine/so/2023/05/10273815/1R6sOyTc8r6