Do Algorithms Dream of Synthetic Data?

February 14, 2022

,

Data is critical to innovation in financial services, from testing new underwriting algorithms to sifting through government filings for sell signals, to training fraud detection models. But real-world data – and particularly consumer data – can be hard to come by, even for incumbent institutions, because data may be siloed around the organization and access restricted by privacy regulations. It is especially hard for startups to gain access to the data they require cost-effectively and in sufficient quantities. 

The use of synthetic data is growing in situations where data availability is limited for whatever reason. Synthetic data can be generated quickly and at scale, so its use should speed up innovation and help get products to market and insights to managers more quickly.

Synthetic data sets can be created based on real customer data so companies can train models and comply with privacy protection regulations such as GDPR and CCPA. Using synthetic data also enables collaboration and partnership where privacy regulations might prevent the sharing of real data, whether within or without the organization.

MOSTLY AI uses a neural network to identify statistical patterns, structures, and variations in the original data which can be reproduced with a high degree of fidelity to the original data and without personally identifying information. The promise is that this synthetic data will enable innovation at lower cost and with greater speed.

Founded in 2017 by data scientists Michael Platzer, Klaudius Kalcher, and Roland Boubela, Vienna-based MOSTLY AI creates structured synthetic data for use in the development of AI and analytics projects in the banking and insurance fields. The firm recently raised a $25 million Series B round of funding led by UK-based Molten Ventures with participation from existing investors Earlybird and 42CAP, and new investor Citi Ventures. With the money, MOSTLY AI plans to accelerate growth in Europe and in the U.S. and will ramp up hiring worldwide. The firm currently has 35 employees.

Tobias Hann is CEO.

Q.     Tobi, what is the nature of synthetic data? What does “synthetic data” actually mean?

A.     The term has been around for a while. Synthetic means “made up” or “artificial”. When we say “synthetic data” we are talking about machine-learning-generated synthetic data, and that may be a little different from the way the term was previously interpreted.

Q.     What is the difference between synthetic data and anonymized data?

A.     Companies today collect a lot of data on who people are and what they have done. When you anonymize data sets, you delete part of the data, you mask part of the data, and you add random noise. You are trying to blur the data a little bit to make it harder to reidentify an individual. 

There are two downsides to this. One, you lose data quality and granularity. The second problem is that there is still the risk of reidentification because you are always working on a specific data record that was there in the real data and that same specific record is still there in the anonymized data set.

Privacy-preserving synthetic data generated via machine learning algorithms has a different approach. You don’t look at one individual record and modify it in a certain way. You look at the full data set and you train a machine learning model with that data set. It’s an unsupervised machine learning process and the goal is to learn as much as possible about the input data – all the statistical patterns, the correlations, time dependencies, and so forth. The result of that machine learning process is basically a statistical model that represents the input data. Then you use that statistical model to create a virtual data set that represents the original data set, but you cannot link any given record in the synthetic data set back to any one individual record in the original data set. You can only link it back to the sum of the data.

Q.     What is the primary use case for synthetic data today? Is it training machine learning models?

A.     That’s one of the main use cases we see, yes. The second big use case we see is around software application testing. You don’t want to use production data for testing. Synthetic data is a very good alternative.

Q.     Why isn’t this something that giant financial institutions, which spend hundreds of millions annually on IT, can do themselves?

A.     We know there are a couple of large institutions that have started to investigate synthetic data, but it is a difficult technological challenge. The way we use machine learning was just not possible a couple of years ago. The machine learning frameworks were not there, the compute power was not there, the cloud was not there. It’s a non-trivial thing to do. For a lot of the clients we work with and the prospects we speak with, it’s not their core expertise. 

Q.     How big is the market for synthetic data?

A.     There is no reliable estimate of how big the market is today for structured synthetic data, which is the field we’re in. There is also the field of unstructured synthetic data, which by itself is very interesting and quite large as well.

Our strong belief is that in the next couple of years every organization of a certain size will start to use structured synthetic data in their data stack. We think the potential is huge.

Q.     What are the limitations, in practice, of synthetic data?

A.     Not all synthetic data is equal. We’ve seen big differences when it comes to the two main aspects of synthetic data. One is the data quality: How accurately does the synthetic data represent the real data? The other is privacy: How strong are the privacy guarantees that you can get?

There are some open-source tools for creating synthetic data and there are many competitors. The quality of the synthetic data, the accuracy, varies widely between solutions and vendors. 

We produce very high-quality synthetic data, especially when it comes to behavior data which has a time component to it. In that respect, our product is far ahead of the competition. With our platform, there’s not a lot that can’t be done with synthetic data. Whatever you would do with the real data, you can do with synthetic data. For other solutions, that will probably not be the case because the accuracy will not be there. The synthetic data would not be as representative as it would need to be.

Q.     How do you demonstrate that accuracy? How do you prove that the synthetic data you create is as complex as the underlying data it aims to replicate?

A.     In two ways. For every data set out platform creates, the client automatically gets a quality assurance report that has a certain number of checks and statistics that compare the real data to the synthetic data that was generated, and we calculate an overall accuracy metric that gives you an assessment of the quality. That’s the first indication. 

The second way is that a lot of companies when they first engage with synthetic data, they basically just do the downstream tasks with the synthetic data that they would have done with the real data and then compare the results. For example, training a machine learning model. They would train a machine learning model on the real data and then they would train the same model on the synthetic data and the outcome will tell them about the quality of the data. What they find is that it is very comparable. 

Q.     How do you price synthetic data?

A.     There’s currently no standard for pricing synthetic data. It’s not like cloud services, where you pay for compute time or storage. Obviously, how much you use the software tools and how much synthetic data you create, how many users have access to the platform can all be factors, but pricing is currently tailored to each situation.  

Q.     Are you planning to tackle unstructured data? Video, audio, images?

A.     It’s not our current strategy. We are the leader when it comes to structured data. We’ve been doing it the longest. We have great expertise and a great product. Unstructured data is fascinating, so I wouldn’t want to rule it out, but we’re not working on it currently.

Q.     How are you distributing the MOSTLY AI platform?

A.     Our clients typically have very sensitive data sets that they want to synthesize, so typically our distribution model is on prem deployment or in the customer’s private cloud. Our software is very easy to install. It ships in containers and is easy to set up. It’s not an infrastructure that requires them to upload data to us.

Q.     If you don’t take possession of sensitive client data, why get ISO certification?

A.     You’re right – because it’s an on prem deployment, typically our solution works totally autonomously and does not require internet connectivity and the data doesn’t get transferred to us. It stays within the secure computer environment of the client. In that sense, we’re not a data processor from a legal perspective.

So why are we ISO certified and also SOC 2 certified? Because we are working with very large organizations – large banks and insurance companies – and they of course want to be able to trust that we are a serious vendor and those certifications show that our processes and our security standards are really up-to-speed.

Q.     Do I need to be a data scientist or have a masters in statistics to use the MOSTLY AI platform?

A.     Anyone can use it. It’s really simple. If you’re able to work with a web interface, you can create synthetic data. We have a public demo online and anyone can sign up to try it. It’s very easy. If you have smaller data sets, you can click and drop and off you go.

Q.     If synthetic data is based on the statistical properties of the real data, isn’t it necessarily backwards looking, and so doesn’t it retain all the biases of the real data?

A.     That’s absolutely correct. Since the data that our platform can create is so representative of the input data, any bias that was in the input data will be present in the synthetic data. It’s not a bug, it’s a feature. We want to produce highly representative synthetic data. 

Yet, at the same time, more and more of our clients and prospects are interested in the opportunity to correct for biases because if you use biased data sets to train machine learning models, it can result in biased models and sub-optimal business outcomes.

One of the opportunities of synthetic data is to correct for these biases by defining beforehand what are the variables, or what is the one key variable, that you want to de-bias. First, you need to define what is a bias, and that’s more of a business decision.

 If you think about the concept of fairness, it’s actually very complex. If you want to create an unbiased – fair – data set, you as the organization must define what it is that you want to de-bias. What is the fairness you want to create? Is it equal income, something else?

Define that and then you can create synthetic data by adding a constraint that corrects for these biases. The beauty of it is that the one relationship or variable that is causing the bias will be corrected for but all the other proxy variables that are also in the data set or that might be in the data set that represent hidden bias, are also corrected for. But all the other statistical properties, all the other correlations, they remain intact. What you’re really getting is a data set that is de-biased, that is fair, but that still contains all the other statistical properties of the real data.

Q.     Are you using proprietary bias-mitigating algorithms specific to financial services, open-source algorithms, or a combination?

A.     We have a proprietary approach to that in our platform and we’ve published some extensive blog posts on the topic because we find it so interesting.

Q.     What are the limitations of your ability to correct bias with synthetic data? Must you know a bias exists to correct for it?

A.     The limitation we touched upon in the beginning is that you have to define what is fair in advance. That’s not up to us. You cannot have a data set and say, “make it fair”, and just click a button and remove all bias.

That’s something more organizations are starting to think about: do they have biases in their existing data? You need to analyze the data. If there are hidden biases, again, there is no magic way to remove them. It doesn’t work like that.

Q.     Does the use of synthetic data make it harder for data scientists to explain how their models work?

A.     Absolutely not. I think quite the opposite is true. In the past, those models were black boxes, and no one could really explain why they were behaving like they were behaving. Synthetic data, we believe, will play an even more important role in the future when it comes to explainable and trustworthy AI because it will allow organizations to create new data sets that they hadn’t seen before that can be used to benchmark and test those models to do what you just referred to which is to describe and explain them.

To explain models, we think the best thing to do is to show a few examples – feeding data into the model and then showing what happens. You can create synthetic data that represents outliers and other cases you may not have thought of when you first created your models in order to stress test them and see what is happening.

Q.     What steps should a company take in terms of governance, in terms of data strategy, before engaging with synthetic data? Before they embark on a synthetic data journey, do they need to have a certain level of data maturity?

A.     It’s a great question. Yes, a certain level of data maturity is required before you engage with synthetic data. If you are still trying to figure out what kinds of data you have as a company, and where that data is stored or hidden, then thinking about synthetic data might be a bit of a challenge. 

What we’ve seen, is that companies that are engaging with synthetic data today are typically the leading companies when it comes to data strategy. They have, for example, invested in data infrastructure, they know what kind of data assets they have, and they’ve moved to the cloud, which makes it easier to work with synthetic data.  So, yes, a certain level of maturity is required.

Q.     Can synthetic data play a role for a financial institution that is contemplating the movement of production data to the cloud?

A.     Absolutely. One of the problems or challenges is that if you are moving to the cloud for the first time, you are going to have a lot of questions and concerns about privacy, so you probably don’t want to move straight away with production data. You might want to test drive it first. We’ve seen companies use synthetic data to do that because it’s much richer than dummy data created by simple rules. So you use synthetic data to test drive your cloud infrastructure and those types of things which can help with adoption.

Q.     How do you win over the data governance and compliance folks at prospective clients?

A.     We have to create trust. Individuals who are working on the compliance side of things want to understand how it works and if the data is really anonymous.

To help create that trust, we’ve invested in external validations and assessments of our solutions. We have an extensive legal and technical assessment we can share with prospects that shows that the data we are creating really is fully anonymous and complies with the relevant privacy regulations.

The other thing that happens is a proof of concept where our software is installed and the client is working with actual data. They will look at the data and the quality. They will determine if the performance of their downstream tasks is the same with the synthetic data as it is with the real data. That creates trust, once they see that. 

The second part consists of in-built privacy tests. For every data set that you create, you receive a privacy report as well that shows that the data is fully anonymous; and if it’s not, it shows what you have to do to correct for that. This also starts to create trust.

Q.     What differences do you see between the U.S. and Europe in interest in MOSTLY AI’s products? Does the need to comply with GDPR make Europe banks and insurance companies keener to use synthetic data than those in the U.S.? Is CCPA driving interest in the U.S.?

A.     When we started MOSTLY AI five years ago, GDPR wasn’t there yet. But yes, we definitely see that regulation helps increase awareness around the topic. Yet at the same time, we also see that more and more organizations without any regulatory pressure are starting to increase their focus on privacy because they see that it can be a competitive advantage, right? It’s something customers care about. If you are taking privacy seriously, you have a competitive edge. And it’s really just the right thing to do.      

Q.     Access to data is frequently a stumbling block for fintech startups. Where did you turn to access that data before clients came on board?

A.     You’re right – it’s often a problem for startups in the fintech space and in other areas. If you don’t have access to data, it’s hard to come up with models, or in our case, model architectures in order to create synthetic data. We were quite lucky to find a bank in Europe that was quite innovative and really wanted to partner with us on this topic. They gave us access to raw banking data to work on our product and develop parts of our platform.

Q.     Have you looked at ways to make synthetic data available to other startups, in partnership with incumbent firms? I’d bet some incumbents would like to see what other early-stage fintechs would do with that data.

A.     We are a software provider. We offer a software tool that allows organizations to create synthetic data based on the exiting data that they have. As mentioned before, the tool runs autonomously within the computing environment of the client. What the clients do with the data and who they share it with is not what we are responsible for. There’s no kind of marketplace for synthetic data. 

However, what we see is that one of the use cases that large organizations are interested in is exactly what you are talking about. It’s sharing synthetic data with third parties – with startups, with vendors, with external consultants, with universities, and so forth. We are certainly seeing that starting to take place. It’s a great use case for synthetic data. With our technology, we are hopefully enabling an ecosystem for others, as well.

Q.     What is the fintech startup scene like in Vienna? What are its strengths and weaknesses? I noticed that your current round did not include any Austrian VCs. 

A.     One of our early, seed stage investors is an Austrian VC. but as the rounds get larger, scale-ups begin to look beyond Austria. This round was led by Molten Ventures, a UK-based firm.

The startup scene in Vienna is growing. There’s still room for more growth but in fintech we have a company called Bitpanda which is very successful and one of the unicorns we have in Austria. There are other cool companies emerging, some in the space of crypto investing.

Vienna is a lovely city so I can recommend looking here if you are setting up a business.

Q.     Are you able to find all the engineering talent you need in Vienna? And the marketing and product talent, etc., or are you looking globally?

A.     We have now a team of about 35 people. Out of the 35, ten are working remotely across Europe. We are moving in the direction of becoming a remote-first company and attracting talent from all over Europe and beyond. Corona has certainly accelerated that.

Is it possible to find talent in Vienna? Yes, it is, but finding the talent you need is a challenge no matter where you are located. You have to differentiate yourself as a company to be attractive.

Q.     You’re growing very rapidly. What are the biggest challenges to maintaining or accelerating that growth? What aspect of it is most difficult to manage?

A.     Recruiting is one. We want to make sure we are hiring people with the right skills but also the right mindset and a cultural fit for the company. We allow time for candidates to get to know us but also for us to get to know them.

The second thing, as we move more into a remote setting, is maintaining the company culture – making sure that communication is alive and that are values are being lived. 

Q.     What are your most urgent hiring needs?

A.     We’re really hiring across the board – from software engineers to data scientists to salespeople to product managers to marketers. We have about 20 openings across all those different areas and all of them are important to us right now. We’ll be adding more new positions throughout the year as we grow.

# # #