Nicolai Baldin, founder and CEO of Synthesized.io, talks about the future of sharing data and the power of synthetic data.
- Synthetic data is a new solution for data sharing needs that’s secure and easier to regulate.
- Synthetic data solves the privacy issues that arise with anonymization, where even anonymized data could reveal private information.
- Bias is a serious issue with real-world data sets, but synthetic data can correct for such bias because it doesn’t rely on existing data.
Data is an essential tool for companies and organizations, but simply using whatever data you can find or collect from users or consumers of a certain product or service isn’t as easy as it sounds.
Many regulations control data usage and sharing to protect individuals’ personal information. In fact, according to the United Nations Council on Trade and Development, 128 out of 194 countries have put legislation in place to secure the protection of data and privacy. So how might we strike a balance?
Synthetic data, says Nicolai Baldin, founder and CEO of Synthesized.io, a company that uses machine learning to provide high-quality, clean data, can fill that gap.
In an episode of Tech on Reg, Nicolai discusses the difference between synthetic data and anonymized data, the security of synthetic data and how it is able to correct for bias.
Essentially, synthetic data maintains the “true essence of data,” mimicking real-world data minus the bias and without exposing real-world data to potential privacy risks.
What is synthetic data?
Synthetic data is data on steroids, Nicolai says.
“Synthetic data understands how data should smell, it understands how data should look,” he says. “It creates a completely new simulated set of data with the same smell, the same taste, and as opposed to neutralizing it, it amplifies it, which is extremely important for development and testing purposes.”
Companies are finding themselves struggling with needing data for development and testing but also maintaining appropriate levels of security and compliance thanks to the myriad of regulations on security and privacy.
To work, synthetic data should resemble real-world data and perform like it, meaning it must have the same mathematical and statistical properties.
“Synthetic data is a completely new set of data which never existed before, but it has the same taste, the same smell — sometimes an even better smell — than the original data,” says Nicolai. “Unlike other solutions to the problem of data sharing, such as data anonymization, which essentially neutralizes the smell, the taste, of data, truly synthetic data enables us to expose the right value, the right properties, of data for development and research purposes.”
Nicolai talks about the “true essence of data” and how we might maintain this value when not using real-world data sets.
Synthesized.io views the data as a container that holds certain information, and that information is the true essence. The growth of datasets, however, is not the same as the growth of information that data carries. The goal is to pinpoint that essence and then share it with the proper infrastructure.
The Synthesized.io platform provides patterns of data that resemble real-world data, allowing companies to query for the pattern they are looking for. Machine learning automates the processes of securely sharing data to help data scientists spend less time on monotonous compliance regulations and more time on their work.
“Data scientists, [machine learning] engineers, test engineers, data users in general, end up spending most of their time making sure our data is ready for analysis instead of doing analysis,” Nicolai says. “And I felt that work should definitely, and can be, automated so that data users can start innovating much, much faster, in a safe but also efficient manner.”
Balancing data sharing with consumer privacy
While companies in fields like insurance, healthcare and financial services need high quality data to provide better services for individuals, those individuals would like their data protected.
This is the core mission of synthetic data.
“We believe that the privacy of personal data is a fundamental right,” says Nicolai. “And there is a way to solve the problem of sharing some information for development and testing purposes in a compliant manner by means of synthesized data science.”
Synthetic data solves the problems posed by anonymized data, a solution that has proven to be less secure than originally believed, because it can be used to identify individuals.
Where synthetic data copies the essence of original data, anonymized data neutralizes it by stripping a dataset of sensitive attributes like age, gender, ID information, etc.
“The problem is that even with that information deleted or obfuscated — well destroyed — it’s possible to link back the anonymized records with original records by combining those anonymized records with some publicly available information,” Nicolai says.
Anonymization also lacks the quality of both real data and synthetic data. Without the values that have been stripped away, anonymized data has lost its true essence.
Correcting bias in data sets
Although the security of synthetic data isn’t guaranteed, it is greater than that of anonymized data and can be controlled from the start. Nicolai’s company ensures the solutions it provides are secure and compliant with data governance and data regulations.
Another advantage of synthetic data is correcting for bias. We know that bias exists in data sets and can cause serious problems when used without awareness of this bias. Both latent bias and selection bias are perfect examples of how biased data sets lead to biased outcomes.
“Our system is able to create completely fair and balanced streams of data and ensure that decisions made by systems which are trained and tested with the data are fair,” he says.
While anonymization only masks already biased data, resulting only in a lesser quality version of the same data, synthetic data addresses the problem at its core.
“Companies try to solve the symptoms, whereas the actual problem is in the data, because that data is not only fed to that credit scoring solution, it is also fed to many other models but also services within the business,” says Nicolai.
“We need to ensure that we solve the problem at its core. This is what we do at Synthesized.io, by ensuring that the bias is eliminated from the data pipelines by creating simulated synthesized data streams to ensure that the decisions made using those data streams are fair and of high quality.”
This article is based on an episode of Tech on Reg, a podcast that explores all things at the intersection of law, technology and highly regulated industries. Be sure to subscribe for future episodes.