Recent Posts
Exploring the AI landscape: Key trends and innovations.
Jan 2025
Biggest tech trends we have seen across 2024.
Dec 2024
Digital Twins in the next decade: Trends, innovations, and industry...
Dec 2024
Data availability or accessibility remains a critical challenge in the delivery of data-driven or Artificial Intelligence (AI) based solutions. This challenge is particularly prominent in ecosystem environments where data sensitivities prevent the sharing of data or in emerging environments where infrastructure and systems are in the process of being implemented.
At Entopy, we have been progressing research to address some of the challenges associated with data accessibility, developing methods to produce accurate synthetic tabular datasets enabling us to expand datasets with synthetic data to overcome the ‘lack of data’ problem that can prevent successful implementation of our AI-enabled Digital Twin software.
This article, written by Toby Mills and Shahin Shemshian, discusses Entopy’s recent progress through the development of an AI model capable of generating highly accurate synthetic tabular datasets, helping to overcome data availability challenges within multistakeholder, real-world operational environments, maximising the efficacy and reducing the time for deployment of its AI-enabled Digital Twin software.
*Entopy’s research and development in this area is contributing to an Academic Research Paper that is expected to be put forward in early 2025 in partnership with the University of Essex and for this reason, this blog will not go into specific details nor share actual results.
The challenge
Data is critical to driving the envisaged wave of transformation discussed by industry and government. But in many cases, a lack of data (or unavailability of data) prevents solutions from being mobilised. This is particularly prominent where multiple organisations are involved in an operation or ecosystem and therefore, data must be contributed by many independent stakeholders, a dynamic true of most real-world operational environments where AI and Digital Twin technologies have the potential to deliver the most profound impacts.
Entopy has developed technology that enables data to be shared between organisations whilst ensuring the privacy and security of that data. Its software has been used in large, ecosystem contexts with highly sensitive data being shared in real-time to support the automation of processes and the identification of key events/alerts.
Entopy’s AI micromodel technology delivers the same capabilities (amongst others) when trying to derive predictive and simulative intelligence across multistakeholder, real-world operational environments, leveraging a distributed network of smaller but more focused AI models, integrating into an overall Digital Twin with the outputs from many models orchestrated together with real-time data to deliver highly effective intelligence across many contexts.
However, it is in the context of delivering predictive intelligence that we have identified additional significant barriers to data availability. This is due to the amount of data needed to deliver effective probabilistic models, therefore requiring large amounts of historical data to be shared. In this context, it is not just ensuring that permissions on data shared can be controlled at a granular level (again, amongst other things) but also that there is enough data available and that only relevant data is requested. To make this simple, we have listed a few of the challenges we have seen below (these are from an Entopy perspective based on what we are seeing and what affects us the most – there will likely be others):
Generating accurate synthetic tabular data with Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN) have been used for a long time to generate synthetic data. They are typically used to generate images or text and have been widely successful. The idea is simple really: to have two competing Artificial Neural Networks (ANN), with one generating ‘fake’ data (the Generator) and the other being fed both real data and fake data and detecting real from fake (the Detector). As the models compete, the Generator model becomes better, delivering more effective fake data in its attempt to beat the Detector to a point that the fake data is indistinguishable from the real data.
In the context of the challenge Entopy is trying to overcome, we wanted to use GAN as a method to generate synthetic data but instead of generating text or images, we wanted to generate tabular data to help train and improve our operational AI models.
There is a BIG difference. The accuracy of images and text is subjective. There’s no right or wrong, only the audience’s perception. But with tabular data, it is binary. Bad tabular data fed into an operational AI micromodel will lead to a very badly performing operational AI micromodel. In partnership with the University of Essex, Entopy’s research has focused on progressing concepts in the domain of synthetic data generation using GAN to deliver effective methods for generating accurate synthetic tabular data.
A specific business need was presented which supported the research activity. A customer wanted to achieve highly accurate predictive intelligence across a specific aspect of its multi-faceted operation (which involves several stakeholders contributing within the overall ecosystem). The area of the operation had a seasonal aspect but due to various system upgrades, there were only 12 months of historical operational data available.
Using the actual data, Entopy was able to achieve a modest AI micromodel performance of ~70% but it was clear that to achieve better model performance and ensure confidence in the operational AI micromodels deployed, more data was needed.
The results from Entopy’s research are models capable of delivering accurate synthetic tabular data by learning from the available real data to it. By statistically analysing and checking the data pattern of both the real and generated datasets, we can conclude that both datasets show the same characteristics. Also, training different ML models on both datasets shows a very similar performance. On the other hand, the generated data must be plausible based on the reality. For instance, if you’re generating data about car speeds, the generated value must only be positive and limited. This analysis shows how ‘real’ the generated data is.
The model was used to expand the original dataset by multiple times and used to train operational AI micromodels, achieving a much-improved model accuracy.
Further research to overcome known challenges
Alongside the GAN research, Entopy has been progressing research in the domain to understand the feature importance of target datasets. This research is progressing concepts in reinforcement learning and its primary use to accelerate the development and deployment of operational AI micromodels.
Entopy has developed effective reinforcement learning algorithms that are deployed today within the operational AI micromodel context, providing predictive intelligence for certain problem types within target environments. However, this research looks to use reinforcement learning as a ‘back-end’ tool, helping Entopy’s, delivery teams mobilise effective operational AI models more quickly, reducing the problem/evaluation process through automation.
However, looking forward, the ability to effectively understand datasets and identify feature importance could be a useful mechanism to overcome certain sensitivity challenges associated with AI-enabled Digital Twin deployments in multistakeholder environments.
What does this mean for Entopy and its customers?
This breakthrough innovation means that Entopy can deploy its software in areas where others can’t, enabling Entopy to overcome data availability challenges through the generation of highly accurate synthetic tabular data to support the mobilisation of highly effective operational AI micromodels.
This is a prominent challenge in Entopy’s strategic focus areas of critical infrastructure, where there is a mix of new and legacy systems with an accelerating ambition to upgrade legacy systems and a distinct lack of available data, which also involve ecosystems of partners causing high sensitivity around data and the sharing thereof.
Furthermore, given the global pressures on critical infrastructure, changes to increase capacity are inevitable. Synthetic data will be a key tool in simulating the impact of future changes and the impact on operational aspects and the ecosystem.