Predicting disk needs

How much disk space do I need to keep n days of data?

The disk space required to store a day of data is a function of:

  • how many exporters are sending data;
  • what data templates each exporter is sending;
  • the cardinality of the data in each exporter/template.

To help understand how much disk space is needed, Plixer Scrutinizer includes details abou disk space that is being used as well as predictions based on your current settings.These can be configured via the Admin > Settings > Data History page. At the top of this interface are the desired data retention settings. Below the settings is the method by which data is being aggregated:

  • Every flow sent to Plixer Scrutinizer is stored in its original form in one-minute buckets (1m). That is the minute the flow was exported if the exporter’s clock is within one minute of the Plixer Scrutinizer collector clock. If the clocks are off, it is the minute the flow was collected.
  • 1m records are “rolled up” or aggregated into higher intervals to allow fast long term trending 1m -> 5m -> 30m -> 2h -> 12h. Rollups are limited to the top N conversations and ordered by bytes to determine what will be kept.
  • Traditional rollups: Every element in the original flow template will be in the higher interval templates. This takes more disk space, and for some elements, the higher interval data has little value.
  • Summary and Forensic (SAF) rollups: Any template with the required information elements will be aggregated into a new template definition containing only common elements (srcIP, dstIP, bytes, packets, etc.). This allows for all common reports to be run (for example, country, IP Group, and AS by IP are based on the src/dst IPs) while storing data more efficiently.

Next is a table displaying how much disk space is being used for each data interval. Finally, there is a table showing how much disk space would be needed based on the current settings and the previously collected data.

What if the disks can’t support the settings?

Disks have a 10% available threshold. When that is passed, 1m and 5m historical tables will be trimmed until disk utilization falls back under that threshold. If a single exporter was coming in for years and then 50 more added, that single export history will be trimmed until all exporters have the same history.

Note

If all flows are between the same two IPs, the data can be stored much more efficiently than if each flow is a unique pair of IPs.