ML Engine#
When deployed as part of a Plixer One Enterprise environment, the Plixer ML Engine applies anomaly and threat detection techniques to the network data collected by Scrutinizer.
Note
To learn more about Plixer One Enterprise licensing options, contact Plixer Technical Support.
This configuration guide introduces the capabilities of the ML Engine and provides further information on managing the various settings controlling its functions and behavior.
On this page:
Overview#
Once deployed and configured, the engine is able to ingest flow data through Scrutinizer and apply multiple machine learning techniques to identify potentially problematic activity on the network.
The Plixer ML Engine has several key functions that enable intelligent, multi-layered anomaly and threat detection in a Plixer One Enterprise deployment:
Comprehensive network behavior modeling: Leveraging the large volumes of flow data collected by Scrutinizer, the engine is capable of building behavioral models encompassing network activity at any scale. It can then learn to recognize deviations and suspicious activity, such as data accumulation/exfiltration, tunneling, and lateral movement, that may indicate an attack on the network.
Accessible behavioral insights for network assets: After being alerted to anomalous behavior, network and security teams can drill down into the associated hosts, IP address groups, and/or exporter interfaces to better understand the details of their involvement in the reported detection.
Highly configurable ML modeling: The ML Engine monitors network activity based on user-customizable dimensions and inclusion/exclusion rules. Consistently repeated traffic patterns, asset/group importance, and data seasonality are all taken into consideration as well, resulting in models that are uniquely tailored to each environment.
ML-based malware detection: Using pre-trained classification models, the engine is able to recognize generic activity patterns that are associated with common classes of malware, including command and control, remote access trojans, and exploit kits. This adds another layer of protection to further reduce risk and mean time to resolution (MTTR) when threats are detected.
Continuous observation and learning: As it ingests additional flow data, the ML Engine updates its behavior models based on a schedule that defines weekdays, weeknights, and weekends to account for changes in legitimate activity patterns and improve recognition of advanced threats that attempt to disguise their behavior.
Managing inclusion and exclusion rules#
To ensure that its behavior models represent only relevant network activity, the Plixer ML Engine can be uniquely tailored to its environment using custom rules defining inclusions and exclusions for its functions. These rules can be managed from the Admin > Alarm Monitor > ML Rules view of the Scrutinizer web interface.
Inclusion rules#
An inclusion rule defines either a network address (hosts/subnets) or exporter interface as a network data source for the ML Engine. Each rule also includes a sensitivity setting (see below) that is applied to the asset specified.
Malware detection, which uses pre-trained classification models to recognize generic malware behaviors, can also be enabled for individual inclusions.
Inclusion sensitivity#
An inclusion’s sensitivity setting can be used to tune the engine’s tolerance for behavioral deviations for the host/subnet or exporter interface. Lowering the sensitivity setting for an asset will cause even minor deviations to be reported as detections, resulting in a higher volume of alarms. Conversely, increasing the sensitivity will allow for greater deviation, which translates to fewer detections reported.
When defining inclusions, the sensitivity setting should be left at its default value. After a period of 7 days (recommended), if too many unwarranted detection alarms are triggered, the sensitivity can be increased to the next level.
Exclusion rules#
Exclusion rules can be used to ignore one or more ML-driven detections for traffic originating from a specified source and/or bound for a specified destination.
If expected traffic/activity triggers alarms, one or more exclusion rules should be created to exempt the sources and/or destination addresses from the detections being reported.
Recommendations#
Inclusion/exclusion rule recommendations
As part of the ML Engine’s initialization, inclusion rules are automatically created for the twenty most suitable network assets (hosts and exporters/interfaces) based on its default dimension definitions. If necessary, additional rules should be created to cover all assets associated with critical/sensitive network activity (“crown jewel” assets) and hard-to-monitor traffic (e.g., IoT devices, operational technology, etc.).
The following resources are examples of network assets that are highly recommended for inclusion:
AD servers
DB servers
DBS servers
DHCP servers
Web servers
Source code repositories
Object repositories
FTP servers
If there are assets whose typical behavior is being reported as anomalous/suspicious, exclusion rules should be defined to exempt the traffic from superfluous detections.
Managing dimensions#
The Plixer ML Engine’s feature dimension list defines the protocols and ports to be observed on the network assets defined by its inclusion/exclusion rules. These dimensions are used by the engine to build its behavior models, which are used to report asset behavior insights, as well as deliver anomaly and threat alerts via the Scrutinizer alarm monitor.
The default configuration for the ML Engine includes recommended dimension definitions, which are used to automatically select suitable data sources as inclusions. After the engine is deployed and set up, dimensions can be managed from the Admin > Alarm Monitor > ML Dimensions view of the Scrutinizer web interface.
Dimension configuration#
An ML dimension is defined by the following parameters:
Inclusion/asset type the dimension applies to (host/subnet or exporter interface)
Template field to use for grouping (
sourceipaddressordestinationipaddress, host/subnet dimensions only)Aggregation method to use (
octetdeltacountorpacketdeltacount)Traffic port used
Note
A feature dimension is only observed for traffic associated with the type of inclusion (host/subnet or exporter interface) it was defined for.
Dimensions can be configured to apply to all or only internal traffic matching the definition. They can also be disabled and re-enabled as necessary.
Recommendations#
Dimension recommendations
Once deployed, the ML Engine defaults to Plixer’s recommended dimension definitions, which are based on the traffic in typical enterprise environments.
These default definitions should be reviewed and, if necessary, additional dimensions should be defined to monitor critical network services that are most often the target of attacks, such as:
Authentication - Kerberos, NTLM
Domain services - LDAP, DNS, DHCP
File sharing services - SMB, NFS, CIFS
Remote connectivity - SSH, Telnet, RDP, VNC, FTP
Email protocols - SMTP, POP3
Inter-process communication - ICMP
Application protocols - HTTP, HTTPS
Others - DB services, third-party APIs (especially those that connect to the Internet)
Global ML settings#
The global ML settings under Admin > Settings can be used to configure parameters for certain ML functions and behaviors across all engines in an environment.
The default values for the above settings/options are recommended for new ML engine deployments but may be adjusted later as described here.
AD Users#
The Plixer ML Engine is also able to ingest user activity data and access logs and alert users to anomalous behavior through user and entity behavior analytics (UEBA) detections.
UEBA alerts for Active Directory users can be enabled by adding the credentials for a Microsoft Azure account that is configured to store AD user sign-in logs under Admin > Settings > ML AD Users.
Alerts#
There are three categories of alert settings that can be adjusted under Admin > Settings > ML Alerts:
Microsoft Office 365 alerts
These sensitivity values adjust the magnitude of deviation from typical behavior that will trigger the corresponding alerts. A higher value allows for greater deviation, resulting in fewer alerts for the corresponding activity.
Logon Sensitivity: Unusual volumes of Office 365 login events
Unique Source Sensitivity: Traffic coming from unusual numbers of unique hosts
Unique Location Sensitivity: Traffic coming from unusual numbers of unique locations
Like inclusion sensitivities, these values should only be adjusted after assessing the accuracy of alarms/detections.
System vitals alerts
These thresholds control alerts and other actions related to high utilization of the ML Engine’s resources.
CPU/RAM/Disk Alert Threshold: Percentages at which a high utilization alert for the corresponding resource is triggered
Disk Reclaim Threshold: Disk utilization percentage at which the ML Engine will attempt to delete old indexes from Elasticsearch
Initially, these thresholds should be left at their default values. If alarms are triggered, run an ML Engine CPU, ML Engine Memory, and/or ML Engine Storage report to assess whether threshold(s) need to be increased (for temporary spikes) or additional resources should be allocated to the engine (for sustained high utilization).
Kafka lag thresholds
These thresholds manage the amount of latency tolerated by the Kafka engine before the corresponding lag alert is triggered.
Kafka Netflow Lag Threshold: Alerts for flow ingestion latency
Kafka K-means Lag Threshold: Alerts for prediction latency
Kafka Alerts Lag Threshold: Alerts triggered by automated process reconnaissance
Kafka Training Data Lag Threshold: Alerts for behavior modeling latency
Kafka UEBA Lag Threshold: Alerts for user and entity behavior analytics (UEBA) data latency
If alarms are triggered, run an ML Engine Kafka Lag report to determine whether there is a need to scale up the engine’s resources.
Data limits#
The ML Engine’s data limit settings manage the maximum numbers of behavior models and hosts used for network/user activity patterns and prediction. The initial values set are based on the engine’s default resource configuration, but they can be adjusted under Admin > Settings > ML Data Limits.
If there are alarms associated with these limits, the engine may need to be provisioned with additional resources to sustain the current volume of inclusions.
Note
To check the utilization for the current model limit, run an ML Engine Model Count report.
Training schedule#
The settings under Admin > Settings > ML Training Schedule determine the seasonality applied when the ML Engine ingests traffic data, allowing it to distinguish between network activity during and outside of an organization’s hours of operation.
The engine defaults to business hours of 8 am to 6 pm, from Monday to Friday. These settings can be changed after deployment if necessary.
ML cluster settings#
The ML engine is built on Kubernetes, which deploys scalable pods to handle various tasks. Most services within the ML engine consume data from Kafka, which acts as the system’s backbone for message passing both from Scrutinizer and between internal components.
Kafka allows the use of consumer groups which allow multiple pods to share workloads efficiently, making it easy to scale services horizontally by increasing the number of replicas. This supports high-throughput processing and flexible resource allocation across services like data ingestion and model training.
To ensure optimal performance based on the scale of the deployment and the volume of data processed, the engine management page can be used to register and manage ML engine deployments and configure various settings for individual engines.
Engine settings are accessed via the configuration tray, which is divided into the sections below.
Settings#
The Settings secondary tray contains the following settings that can be adjusted to adjust resource allocations for specific engine tasks/services:
Ingestion Replica Count
This setting defines the number of data-ingestion pods running in the cluster. Adjusting this value helps scale the ingestion throughput and ensures SLAs are met.
These pods are responsible for consuming netflow data from Kafka topics, processing and aggregating flow records for both security (PSI) and network (PNI) monitoring, storing the processed data into Elasticsearch indices, and handling classification data for supervised ML models. The ingestion service runs continuously, processing data in one-minute intervals and logging heartbeats to ensure operational visibility.
Train Anomaly Detection Replica Count
This setting specifies how many pods are deployed to train ML models for the anomaly detection service.
This service uses techniques such as silhouette analysis and overfit deviation detection with configurable thresholds and limits. Once trained, the models are published to Kafka topics for use by downstream services. Scaling this service allows for faster and more resilient model training, particularly in environments with large or complex datasets.
Ingestion CPU & Memory (Min/Max)
These settings define resource limits allocated to the data ingestion service.
These resource limits ensure that each ingestion pod has the CPU and memory it needs to handle high-throughput data processing tasks. The ingestion service performs complex transformations, maintains multiple in-memory maps for real-time analytics, and conducts bulk insert operations into Elasticsearch. Typically, memory allocations in the range of 1 GB to 2 GB are required to support the various data structures used during processing.
Elasticsearch Memory and CPU (Min/Max)
These settings define resources allocated to the Elasticsearch cluster pods managed via the ECK (Elastic Cloud on Kubernetes) operator.
Since Elasticsearch is used to store and index all processed flow and classification data, and must support real-time search queries for machine learning and security operations, it is essential that sufficient resources are provided. The Java heap is configured with 8 GB (using -Xms8g -Xmx8g), and a total memory allocation of approximately 12 GB is recommended to provide additional headroom for OS and Elasticsearch operations. Similarly, minimum and maximum CPU allocations help maintain consistent indexing and query performance.
The Kibana UI can also be deployed alongside Elasticsearch by toggling on the Enable Kibana option.
Collectors#
Collectors selected here will be used as data sources for ingestion by the current engine.
DGL IP Groups#
IP groups added to the Deep Graph Learning inclusion list will be monitored by the engine to identify anomalous interactions between hosts.