Many fields require anomaly detection, including cybersecurity, networking, finance, healthcare, and more. It’s a way to spot data that differ from previous observations, such as deviations from normal distributions, expected probability distributions, or a change in the shape and amplitude of a signal in a time series.
Network anomalies
The term anomaly describes a sudden, short-lived deviation from normal network operation. Anomalies may be deliberately caused by malicious intruders, such as a denial-of-service attack in an IP network, or they may be unintentional, such as changes in packet transfer rate (throughput). In both cases, quick detection is essential for prompt action.
Due to the high data rates collected by network monitoring devices, an effective anomaly detection system requires the extraction of relevant information from high-dimensional, noisy data. There are a variety of anomalies in the network statistics that are expressed in different ways, which makes developing general rules or models difficult.
It is also difficult to transfer model-based algorithms between applications, and even small changes in network traffic or physical phenomena become problematic. The use of non-parametric algorithms based on machine learning principles can help algorithms learn the nature of normal measurements and adapt as the structure of “normality” changes. For detecting network anomalies, rigid models do not work well, but algorithms based on machine learning (ML) that learn user behavior are better.
While anomalies can be found in many fields, the ways to address them are very similar.
Detecting anomalies is challenging because it is difficult to distinguish between normal and abnormal behaviour. This is usually not precise and evolves over time. Moreover, anomalies are rare events, so normal cases are marked differently from anomalous cases. As a result, semi-supervised or unsupervised learning is used more frequently than supervised learning.
Semi-supervised and non-supervised anomaly detection learning
Semisupervised anomaly detection assumes that the training dataset only contains normal-class labels. During testing, the model detects anomalies by learning the normal behavior of a system.
An unsupervised learning technique assumes that outliers (data points that differ significantly from others) constitute a very small fraction of the total data.
There are three main types of anomaly:
- Point anomalies. A data point that differs remarkably from the rest of the data points in the dataset considered.
- Contextual anomalies. The anomaly of a data point is related to the location, time or other contextual attributes of other points in the dataset. For example, it is reasonable to assume that water consumption in households is much higher at night during a weekend, compared to midday during a workday.
- Collective anomalies. When a group of points are treated as an anomaly on the whole, but specific data points in that group are considered normal. Collective anomalies can only be detected in datasets where the data is related in some way, i.e., sequential, spatial or graph data.
Machine Learning approaches exist for finding anomalies
Several Machine Learning approaches exist for finding anomalies. Anomaly detection algorithms use Isolation Forests,Local Outlier Factors (LOF), or clustering algorithms.
These methods share the feature of being unsupervised learning algorithms, which do not require labelled data to identify anomalies. Rather, to identify anomalies, they analyze the data structure.
In addition, they all use distance or dissimilarity between data points to identify anomalies. While each Machine Learning method has its own unique approach, they share these commonalities and can be combined to improve anomaly detection accuracy.
Deep learning techniques
Anomalies are such a pressing issue that tech giants such as Google, Amazon and Facebook have developed their own deep learning algorithms to find them. RRCF is Amazon’s improved version of the Isolation Forest algorithm. Time forecasting (prediction) is another method of finding anomalies in time series developed by Google and Facebook.
Anomaly detection in time series data has become popular with deep learning. Variational Autoencoders (VAEs) are one of the most advanced techniques.
Essentially, VAEs encode high-dimensional data into a lower-dimensional latent space and then decode it back into the original space (like watching a 3D movie on a 2D TV). Using VAEs, anomaly detection can be performed using normal time series data and then identifying data points that deviate significantly from the normal distribution.
Generative Adversarial Networks (GANs) are another advanced deep learning technique for anomaly detection. In GANs, two neural networks work together: a generator that produces fake data, and a discriminator that differentiates fake data from real data. GANs can be used to generate synthetic time series data based on normal data, then identify any real data points that deviate significantly from the synthetic data.
Article derived from Dr. Yosef Yehuda (Yossi) Kuttner, Ph.D. for RAD