Anomalystream: European Data Incubator
Anomalystream: European Data Incubator
From May to October 2020 we were involved in the European Data Incubator Open Call, an European initiative that launches datathons from time to time. Our challenge was meteo-related and proposed by UBIMET: detecting changes in near-real-time in forecasting climate datasets.
To tackle it, we developed Anomalystream. A solution based on Apache Kafka, that enables to identify anomalies in near-real-time, processing over 150 MB per minute, monitoring more than 300 parameters. The idea behind it was to develop a system that can process a high volume of data and detect anomalies that are not limited to climate data. Why? Because a wide range of sectors are facing this same problem: Industry 4.0, smart cities, Internet of Things... All these industries produce high volume of spatio-temporal datasets, in real time or near real-time, and usually these data are not fully quality controlled. Therefore, having a solution to identify anomalies in a flexible mmaner is of great advantage to all of them.
Key features
The challenge
Description
Anomaly detection is of interest in many industry areas. Thus, novel solutions on how to detect changes in large datasets is a challenge that requires efficient methods. While the problem can easily be transferred into other domains, this specific challenge focuses on weather forecast data.
Data
The challenge had two sample datasets, corresponding to 2 forecast model runs, each containing 320 parameters for 66 forecast hours. The full dataset to be analysed contained 2 historic runs per day with a forecast horizon of 66 hours for up to 2.5 years (~13 TB/year).
Expected outcomes
- Identify: (1) datasets, (2) affected parameters and (3) time-step from which a deviation in their parameter characteristics from previous datasets (anomalies or step-changes) can be observed.
- The system needs to be able to detect changes in near-real-time to inform subsequent systems of issues to prevent uncontrolled propagation. A live system will need to be able to process at least 150 MB/min with 320 parameters.
- UBIMET was encouraging solutions that make use of scalable parallel algorithms and infrastructure.
The solution: Anomalystream
AnomalyStream counts with several main components:
- Data Adaptor
- Message producer API
- Anomaly Detection Engine
- Anomaly Notifier
This structure enables the solution to process over 10 GB of data per minute on a single node.
Data Adaptor
Since Industry 4.0, Internet of Things, Climate Data providers and other industries use different formats, the Data Adaptor currently supports multiple scientific data formats: GRIB, NetCDF, HDF5 and GeoTIFF.
Message producer API
Once the data have been adapted to a digestible format, the API chunks it into different messages, to allow a parallel treatment, to ensure a fast processing and avoid bottlenecks.
Anomaly Detection Engine
The ADE incorporates several funcions:
- Rule and histogram based methods:
- Empirical outliers
- Known valid ranges
- Trend and frequency changes
- ML supervised methods (LSTM)
- ML unsupervised dynamic methods
The system also counts with an Anomaly Trainer, to produce Anomaly Detection supervised models.