Inside Effective EDR Evaluation Testing

Periodically, I receive requests from customers asking for explanations on why this particular technique or that one doesn’t generate a Malop™ in the Cybereason Defense Platform. Such questions illustrate that there is still a great deal of education to be done on the nature of EDR across much of the security industry. 

This is a statement of fact without judgement. Many of my fellow analysts seem unable to conceive of a product that works in a way beyond their technical comprehension. As an example, I received a request from someone asking about the following command and - given that the command could be used suspiciously - why didn’t the system generate a Malop or at least mark the command as suspicious?

PowerShell: Command sleep(60)

It is a fine question, as the command parameter can be used in malicious code. In isolation, however, the parameter itself isn’t terribly useful as any kind of indicator of malicious behavior. Yet, the question itself exposes several underlying issues in our industry. 

First, many analysts can’t seem to get away from the basic signature concept. The idea that attackers can be identified by some miniscule granular detail like in some episode of CSI, and if you could just see them all and correlate enough of these Indicators of Compromise (IOC), you could find the bad guys. 

Second, many analysts think that they understand technologies that they have never seen. Sometimes they can, but even then, the underlying nuances of these technologies can escape them. 

Finally - and this lands squarely in the hands of the security giants who preceded us - they don’t trust things that they don’t understand. In all honesty, who would want to trust any security vendor after years of believing what we were being told by the likes of Symantec, McAfee, and Trend Micro: that they could secure our networks only to have them compromised again and again and again. When something fails, people often forget the caveats in the license agreement and walk away bitter regardless of how great it seemed before the bad thing happened.

It is normal, given the history of the industry, that people do not want to work with a tool that they have not thoroughly tested. Unfortunately, such testing, especially with EDR, if undertaken in earnest, is expensive and requires specialized skills - red-teaming skills. Amateur testers however will use whatever tools they believe are sufficiently sophisticated to execute their evaluations, believing that the capabilities of the author will make up for their lack of expertise.

These “tests” often take the form of scripts that execute different activities in isolation, similar to what I have previously mentioned. The results of these tests often disappoint the user and lead to a great deal of anxiety, and sometimes to anger and frustration. But it’s not the EDR industry that is to blame here. They, rightly so, approach behavioral analysis with as much of a tight grip as they can, attempting to make certain that their systems provide value without being rife with false positive indicators that are unwieldy and in some cases unusable. 

Below we go into some of the methodology of EDR and we analyze one such “testing” tool in its entirety to demonstrate the flaws in the use of the tools and explain why the results are unreliable when pitted against a modern EDR system.

EDR Methodology

With a great many of my interactions with analysts, there seem to be some interesting preconceived notions about how EDR works. So, let’s look under the covers in a generalized sense of how EDR functions by looking at what it takes to build one, conceptually at least.

We’ll start with our goals and we’ll use the KISS methodology (keep it simple, stupid). First and foremost, we have to, in the words of Colonial Space Marine Corporal Hicks, “Count on them getting into the complex.” We have to assume that all of our defense in depth techniques of ages past will fail. 

The attackers will skirt the firewall by using outbound communications that are permitted. They will use either fileless malware or novel malware to skirt antivirus defenses. They will live off the land, potentially, by using known administrative tools for efforts like recon and privilege escalation. In general, we have to assume that they will be successful at initial intrusion, and that we need to be able to detect this and as much follow on activity as possible. With that in mind, let’s look at some options.

The first and simplest option  is to create a list of techniques which we will flag as known to be malicious. The MITRE ATT&CK Framework has a pretty comprehensive list, so it seems a good choice. We’ll skip the reconnaissance category to start with, because our EDR is supposed to tell us when bad guys are in the systems, not when they are passing by looking at our servers. Instead, we’ll focus on initial access first, because that’s where attackers have their first real footprint in the environment.

Immediately we run into a problem though with technique number T1078, Use of Legitimate Credentials to access remote external services. Geez, we can’t alert on that, as it would create an alert for every login to a remote service. Depending on who the customer is, that could be thousands of additional alerts per day. Okay, so, maybe that’s a bad one to use. We could skip it, but login sessions are important and collecting that data is extremely helpful to my analysts, so we’ll collect it and see if it’s useful later with other data for alerting or, at the very least, building context.

And now we’ve covered the first major concept in EDR: Event Correlation. Not every piece of data is worthy of building an alert, regardless of what you might think. MITRE is a generalized mapping framework and does not distinguish between what is normal and what is, to use a little jargon, “detectable.” This concept of detectability will become vital as we move forward. 

Okay, so we’ve decided that T1078 is not a good candidate for an alert by itself. Let’s look at subtechnique T1078.001, Valid Accounts: Default Accounts. That sounds good, but wait, now we’ll have to have a list of all of the default credentials for all of the equipment for our customers, and some of that equipment is made by competitors. I suppose I could put a resource ongoing through every piece of externally facing network equipment that is commonly used and program those credentials into our systems? 

Wait, what is that you say? There are thousands of different possibilities here. Now we begin to expose why using “atomic” techniques (i.e. each technique as an individual entity) is not terribly useful most of the time. There are few techniques in the MITRE ATT&CK framework that make for good detections without extensive correlation. This is because MITRE defines each technique in a very general way in plain English.  Implementation of the identification of this technique is left to the vendors imagination.  They give examples, sometimes, but then those examples are so specific as to miss a tremendous amount of behavior.  It requires quite the balancing act to build something useful this way.  It’s not a great approach anyway, because most of these techniques don’t become apparently malicious until taken in context.

Wait, maybe if I combine different techniques into a single detection I can get a good detection? That’s great, but do you want to tag every behavior that matches with a property that says it “could be” this or that MITRE technique? Probably not. Why? Because later, when you start using those tags to build detections you will find you have painted yourself into a corner in terms of storage space, processing power, or endpoint cycles, all of which are finite 

Let’s take technique T1059.001, Use of PowerShell. Let’s say I couple the use of a privileged account, say the Windows LocalSystem account. That’s probably still not good enough because every PowerShell script set up as a scheduled task will generate an alert. Lots of administrators and software vendors do that. Okay, so lets add in downloads from the internet. Unfortunately, many vendors do that too. Remember, we’re not trying to alert every time a piece of software is patched by PowerShell.

Now we begin to see how hard it is to build a clean alert. But, we’re still determined, so we’re going to add one more value to our event chain so that we can get something that is at least reliable: PowerShell run by a privileged account to download a payload from an external IP address that is listed as malicious by VirusTotal. 

Whew. We have our first detection. It’s far from perfect, but it’s better than where we started. Great, now which MITRE technique are we tagging this with? I guess we’ll raise this detection and tag it with several. This constitutes abuse of a privileged account, abuse of PowerShell scripting, malicious download, and malicious network access. Cool. But now we can see why many of the automated scripting “test” tools suddenly give us unreliable data. We designed our detection based on the available information but we had to tag things as suspicious after correlation.

We don’t need to go further in our building an EDR to understand why: 1. This is really hard to do, and 2. What seems simple at first is much more complicated than it appears at first blush. I bet many of us didn’t know that most of our suspicion tagging happens after one or more layers of correlation. That means that something executed in isolation may never get a tag for the MITRE Framework. With that in mind, we’re going to move on to an example from a testing script.

Testing an EDR Solution

We’ll take the OP7IC EDR Testing Script found on GitHub. First, let me complement the author on a well thought out and very well structured script that executes various techniques that could be alert-worthy with enough behavior to look at. This person really understands the nature of the techniques. However, for various reasons that the author lays out around safety, none of these techniques are actually used as part of malicious behavior. Some of them are alert-worthy and most EDR’s will alert on them, but many are not. Remember our single PowerShell example and how complicated it was just to get to something that is tight enough to alert on but general enough to be useful to identify potentially malicious behavior.

Let’s look at how OP7IC’s script tests this capability initially. The first execution is the following which creates a transfer job using bitsadmin. Well, large corporations - our target customers - do this all the time. So this isn’t alert-worthy, even when executed as required by an administrative account. Our rule isn’t going to fire here, it’s bitsadmin, not PowerShell:

start "" cmd /c bitsadmin.exe /transfer "JobName" https://raw.githubusercontent.com/op7ic/EDR-Testing-Script/master/Payloads/CradleTest.txt "%cd%\Default_File_Path.ps1"

 

Next it executes this:

start "" cmd /c PowerShell-c "Start-BitsTransfer -Priority foreground -Source https://raw.githubusercontent.com/op7ic/EDR-Testing-Script/master/Payloads/CradleTest.txt -Destination Default_File_Path.ps1

 

This execution starts the bits transfer via PowerShell. Now we’re getting closer to our rule, but there are some problems: Github is not malicious - at worst, it’s a neutral host. Secondly, the IP address changes constantly. Alright, so how can we make sure this gets picked up accurately? Well, the script it downloads and executes is a simple script that contains the word Mimikatz. So, we can inspect the script and see if it contains Mimikatz. 

We suppose that can work, except it’s a really bad way to detect Mimikatz. There are literally thousands of distinct payloads that execute Mimikatz without ever having the word Mimikatz in them. So, to detect credential theft through Mimikatz, we need to be looking for memory modification or memory reads from the lsass memory space. So, we build that capability into our tool too. Now we’re getting somewhere. 

So, if we now have the means to detect Mimikatz reliably at the endpoint via it’s actual behavior, it doesn’t make much sense to inspect every PowerShellscript for the word “mimikatz.” So, as a vendor, I’m going to make an executive decision to say that consuming endpoint cycles to inspect PowerShellscripts for the word Mimikatz isn’t worth my time. Some vendors had this as an early detection while they were building real Windows credential theft detection capability just to give us something, and that detection may still be there, but if I’m building today why would I bother? 

And that is how these “testing scripts” fail: If the script doesn’t actually execute something malicious like Mimikatz from a remote location, should you expect an alert? No.

I could go through all of the different executions in the OP7IC testing script to show which ones are legitimate and which ones need changes to be made so. But, the point of this post isn’t to call out this particular author, who for all intents and purposes seems very intelligent, capable and competent. The problem isn’t with the author or the script itself anyway. It is with the approach. Trying to mimic malicious behavior without actually mimicking malicious behavior is a fine line that, to this author's knowledge, no one has tread well. 

If a vendor appears to perform poorly using the aforementioned script, it really doesn’t tell us if their detectors are good or bad. If a vendor performs well, it doesn’t tell us whether their detections are good or bad. So, is the testing script useful in its current iteration as an EDR Testing tool? Not really. The only thing it’s any good for is for generating telemetry so you can see if a specific EDR tool logged the activity. So what is someone to do? What we need is an external non-vendor related group that will test these tools for us and provide us with detailed results we can consume.

MITRE ATT&CK Evaluations

As it turns out, there is a group that executes testing of EDR systems for capability against attackers using the MITRE ATT&CK framework. That would be MITRE. They are an industry independent organization that invites vendors to bring their best game as they run a full emulation of a specific attacker's techniques. 

To date, they have performed three such evaluations and the results speak for themselves. So, it’s a pretty good measure of the detective capability that any EDR might have. The most recent evaluation as of this writing included prevention capability as well, so it covers endpoint prevention stacks included with EDR. Best of all, we don’t have to spend a penny to get the results, they are publicly available (you can see details on how Cybereason performed in the latest MITRE ATT&CK evaluation here).

Red Team Testing

Alternatively, you could have an organization run a full attack simulation against a set of systems loaded with your EDR of choice or repeatedly against systems loaded with different EDR tools. This typically gives you a good perspective on whether or not they will alert effectively. That assumes that the red team you are using is unbiased and competent. 

In the end, it’s clear that using EDR testing tools found on GitHub or laying about the internet isn’t the best approach. They generally lack third party validation, operate with an “atomic” approach, and don’t actually mirror real attack techniques. There are better ways, and it’s often best for us to leave testing to the experts.

Aoibh Wood
About the Author

Aoibh Wood

Aoibh Wood, Senior Director of Partner Service Delivery, has 30+ years of technology industry experience. During the last 14 years, she has provided forensics and incident response support for US Federal and state organizations and as well as Fortune and Global 100 companies.

All Posts by Aoibh Wood