Smart Filtering, Smart Sampling and Smart Scaling

When it comes to the best strategy for hunting, detection and security data analysis, what’s actually “smart”?

In security data analysis, hunting and AI-driven automated detection, the quality of your results depends heavily on the quality of your data. Ultimately, the security value delivered depends on the quality and accuracy of the data we collect, store and process. However, we often find ourselves having to process more data than we or our systems can handle. 

I’d like to discuss a few strategies for handling the data and the advantages and disadvantages of each approach:

  1. Sampling - analyzing a statistically significant subset of the data.
  2. Filtering - removing data that we deem unimportant or repetitive.
  3. Scaling up - finding tools and technologies that will allow us to process all data in an effective way.

Of course, we would love to process all data all the time, but sometimes time and cost is a limiting factor. Sampling can be an effective way to learn about the statistical nature of the data, but it's not very useful if you're looking for a needle in a haystack, or if you require access to specific data.

Filtering is a good strategy when you have high certainty that your filtering methods are reliable, do not change the statistics of the data collected and can be guaranteed to retain all important data.

These strategies can each be applied to many domains, but in some domains there are special considerations that limit the efficacy of particular methods. In security there's the added difficulty of working against a smart adversary. When considering adversarial manipulation, data integrity becomes paramount. Am I missing data because of adversary behavior, or because of expected loss? Did an adversary learn my filtering strategy and is simply taking advantage of it to remain undetected?

Let's consider a filtering approach. Say that we limit the collection of network data to 100 connections for every process. On the surface, this sounds reasonable. The average number of connections per process is much lower on most endpoints, so you can expect to filter very little of the data.However, in reality, data patterns in computerized environments follow aggressive power law distribution and not linear or even natural distribution.

For example, your word processor will open very few connections, sometimes zero, sometimes one or two to an update site - very rarely more than that. However,  your browser opens dozens of connections when opening a single website and will easily have thousands of connections per day.

This problem gets much worse when you look at servers - on servers the number of inbound connections to a specific process which performs the primary function of that server - i.e. your oracle database, your mail server etc. will receive 99.9999% of all inbound communication.

Because of this behavior, any type of cap based filtering will remove the vast majority of the data.

No matter which strategy we chose to limit data collection,we will always hit these limitations.

The problem of course is that data that isn't collected or transmitted to our servers cannot provide visibility, cannot be used for hunts and cannot be used for server side detections.

Even when we try to pick a strategy that optimizes for the potentially malicious we always end up hitting against these issues:

  1. The malicious uncertainty principle - you can never be 100% sure that you're right about something not being malicious.
  2. The power law - whatever the limiting mechanism, it will always hit power law behaviors which will force arbitrary filtering.
  3. Adversarial behavior - when an adversary understands how you limit your data collection they can use that to circumvent your protection.
  4. Breaking up forensic integrity - you will always have missing data, and because of the power law nature of the data - you will have a lot of missing data - which will prevent you from having a forensic trace critical to incident response.
  5. Unverifiable data loss - when data is missing, you can never know if it's ok or not. you can never be sure if it's by design, by error or by malicious behavior.

Some vendors talk about “smart filtering.” The problems above illustrate that there’s nothing about filtering that is smart. It’s not designed to reduce “noise.” It’s merely a strategy to overcome technological limitations of the server-side systems and save on the cost of the solution. However, this comes at a significant cost to the integrity of the data, the quality of detection, and the security value provided by the system.

Because of these issues we at Cybereason chose strategy number 3: to collect everything, process everything, keep everything and give all the data to hunters. The challenge with this strategy is, of course, time and cost. You need to process the data fast enough, and you need to ensure that the system is efficient enough to process all that data. Moreover, you need to quickly correlate subsets of differential data, which requires a new algorithm.

At Cybereason, we chose the most difficult strategy - scaling up to process all data. We chose it because we arrived at the same conclusion again and again; namely, when you apply arbitrary / smart / statistical filtering, you will inevitably introduce blindness to your system.

And hackers will exploit it -- either deliberately by understanding how you made your decision, or by accident, because you can never have 100% certainty on what particular piece of data can be completely ignored. We created new technology to ensure that data can be processed quickly and efficiently. For us, the truly smart approach to filtering data is not to filter it at all.


Wouldn’t it be nice to sit back and hear from an industry expert about what matters most to your security team(s) for endpoint security?

Join us on Tuesday April 28th at 11:00 am (EDT) to learn what enterprises are currently evaluating from an endpoint security perspective and hear from Enterprise Strategy Group’s (ESG)  Dave Gruber on endpoint trending and priorities based upon first-hand research conducted by ESG.

Register Today »

Yossi Naar
About the Author

Yossi Naar

Yossi Naar, Chief Visionary Officer and Co-Founder, is an accomplished software developer and architect. During his 20 years of industry experience, Yossi has designed and built many products, from cutting-edge security platforms for the defense industry to big data platforms for the AdTech / digital marketing industry as well as the Cybereason in-memory graph engine.

All Posts by Yossi Naar