Entopy makes breakthrough in the generation of synthetic tabular datasets to overcome data accessibility challenges.

Data availability or accessibility remains a critical challenge in the delivery of data-driven or Artificial Intelligence (AI) based solutions. This challenge is particularly prominent in ecosystem environments where data sensitivities prevent the sharing of data or in emerging environments where infrastructure and systems are in the process of being implemented.

At Entopy, we have been progressing research to address some of the challenges associated with data accessibility, developing methods to produce accurate synthetic tabular datasets enabling us to expand datasets with synthetic data to overcome the ‘lack of data’ problem that can prevent successful implementation of our AI-enabled Digital Twin software.

This article, written by Toby Mills and Shahin Shemshian, discusses Entopy’s recent progress through the development of an AI model capable of generating highly accurate synthetic tabular datasets, helping to overcome data availability challenges within multistakeholder, real-world operational environments, maximising the efficacy and reducing the time for deployment of its AI-enabled Digital Twin software.

*Entopy’s research and development in this area is contributing to an Academic Research Paper that is expected to be put forward in early 2025 in partnership with the University of Essex and for this reason, this blog will not go into specific details nor share actual results.

The challenge

Data is critical to driving the envisaged wave of transformation discussed by industry and government. But in many cases, a lack of data (or unavailability of data) prevents solutions from being mobilised. This is particularly prominent where multiple organisations are involved in an operation or ecosystem and therefore, data must be contributed by many independent stakeholders, a dynamic true of most real-world operational environments where AI and Digital Twin technologies have the potential to deliver the most profound impacts.

Entopy has developed technology that enables data to be shared between organisations whilst ensuring the privacy and security of that data. Its software has been used in large, ecosystem contexts with highly sensitive data being shared in real-time to support the automation of processes and the identification of key events/alerts.

Entopy’s AI micromodel technology delivers the same capabilities (amongst others) when trying to derive predictive and simulative intelligence across multistakeholder, real-world operational environments, leveraging a distributed network of smaller but more focused AI models, integrating into an overall Digital Twin with the outputs from many models orchestrated together with real-time data to deliver highly effective intelligence across many contexts.

However, it is in the context of delivering predictive intelligence that we have identified additional significant barriers to data availability. This is due to the amount of data needed to deliver effective probabilistic models, therefore requiring large amounts of historical data to be shared. In this context, it is not just ensuring that permissions on data shared can be controlled at a granular level (again, amongst other things) but also that there is enough data available and that only relevant data is requested. To make this simple, we have listed a few of the challenges we have seen below (these are from an Entopy perspective based on what we are seeing and what affects us the most – there will likely be others):

  • Not enough data exists: In cases where there is a seasonal component to operational activity, a minimum of 12 months of data history is required. For some organisations, the use of data is relatively new and the capture of data into useful formats is a relatively new phenomenon. This means that oftentimes, we are seeing organisations want to ‘do AI’ but simply don’t have the data available to support it. This is obviously also true for new infrastructure, new operations and so on.
  • Systems are changing: For many organisations, the upgrading and modernisation of systems is happening at pace meaning new systems are being implemented across the organisation(s). Issues can arise when the new systems replace the old. This can be relatively trivial issues such as formatting or it can be more material when the new systems are generating radically different datasets.
  • It’s too sensitive: In ecosystem contexts, data will be required from many organisations. Sharing data about a specific process in real-time might be acceptable but sharing a 3-year historical dataset of a particular part of an organisation’s operation is a completely different kettle of fish. This is very much lifting the bonnet and a huge amount of trust is required that this historical information will not be used maliciously or in bad faith.
  • Time, time, time: The amount of time it takes to get the necessary agreements in place and build mutual trust is extensive. As Data-driven and AI solutions enter the arena of ecosystems and critical infrastructure (which is where they will have the most notable impact) these challenges are set to perpetuate. And often, seeing is believing, and it is belief that value will be realised which is the ultimate drive to share any data at all.

Generating accurate synthetic tabular data with Generative Adversarial Networks (GAN)

Generative Adversarial Networks (GAN) have been used for a long time to generate synthetic data. They are typically used to generate images or text and have been widely successful. The idea is simple really: to have two competing Artificial Neural Networks (ANN), with one generating ‘fake’ data (the Generator) and the other being fed both real data and fake data and detecting real from fake (the Detector). As the models compete, the Generator model becomes better, delivering more effective fake data in its attempt to beat the Detector to a point that the fake data is indistinguishable from the real data.

In the context of the challenge Entopy is trying to overcome, we wanted to use GAN as a method to generate synthetic data but instead of generating text or images, we wanted to generate tabular data to help train and improve our operational AI models.

There is a BIG difference. The accuracy of images and text is subjective. There’s no right or wrong, only the audience’s perception. But with tabular data, it is binary. Bad tabular data fed into an operational AI micromodel will lead to a very badly performing operational AI micromodel. In partnership with the University of Essex, Entopy’s research has focused on progressing concepts in the domain of synthetic data generation using GAN to deliver effective methods for generating accurate synthetic tabular data.

A specific business need was presented which supported the research activity. A customer wanted to achieve highly accurate predictive intelligence across a specific aspect of its multi-faceted operation (which involves several stakeholders contributing within the overall ecosystem). The area of the operation had a seasonal aspect but due to various system upgrades, there were only 12 months of historical operational data available.

Using the actual data, Entopy was able to achieve a modest AI micromodel performance of ~70% but it was clear that to achieve better model performance and ensure confidence in the operational AI micromodels deployed, more data was needed.

The results from Entopy’s research are models capable of delivering accurate synthetic tabular data by learning from the available real data to it. By statistically analysing and checking the data pattern of both the real and generated datasets, we can conclude that both datasets show the same characteristics. Also, training different ML models on both datasets shows a very similar performance. On the other hand, the generated data must be plausible based on the reality. For instance, if you’re generating data about car speeds, the generated value must only be positive and limited. This analysis shows how ‘real’ the generated data is.

The model was used to expand the original dataset by multiple times and used to train operational AI micromodels, achieving a much-improved model accuracy.

Further research to overcome known challenges

Alongside the GAN research, Entopy has been progressing research in the domain to understand the feature importance of target datasets. This research is progressing concepts in reinforcement learning and its primary use to accelerate the development and deployment of operational AI micromodels.

Entopy has developed effective reinforcement learning algorithms that are deployed today within the operational AI micromodel context, providing predictive intelligence for certain problem types within target environments. However, this research looks to use reinforcement learning as a ‘back-end’ tool, helping Entopy’s, delivery teams mobilise effective operational AI models more quickly, reducing the problem/evaluation process through automation.

However, looking forward, the ability to effectively understand datasets and identify feature importance could be a useful mechanism to overcome certain sensitivity challenges associated with AI-enabled Digital Twin deployments in multistakeholder environments.

What does this mean for Entopy and its customers?

This breakthrough innovation means that Entopy can deploy its software in areas where others can’t, enabling Entopy to overcome data availability challenges through the generation of highly accurate synthetic tabular data to support the mobilisation of highly effective operational AI micromodels.

This is a prominent challenge in Entopy’s strategic focus areas of critical infrastructure, where there is a mix of new and legacy systems with an accelerating ambition to upgrade legacy systems and a distinct lack of available data, which also involve ecosystems of partners causing high sensitivity around data and the sharing thereof.

Furthermore, given the global pressures on critical infrastructure, changes to increase capacity are inevitable. Synthetic data will be a key tool in simulating the impact of future changes and the impact on operational aspects and the ecosystem.