February 21, 2018 | 4 minute read
Malware developers regularly introduce deception tactics to avoid being caught by security programs. Some of these techniques easily bypass standard anti-malware protection. In fact, in customer environments Cybereason has observed thousands of malicious file executions masquerading as a popular programs such as Adobe PDF Reader, MS Word and Chrome. Using familiar icons is meant to deceive users into thinking that the file is legit and safe to open.
A novice masquerading technique would be to just use a popular icon and not alter it. But many vendors track usage of their more popular programs and flag unfamiliar programs that use their icons. To avoid getting caught, the person generating the fake file slightly changes the popular icon that’s being used. While the file containing the payload looks nearly identical to the program’s real icon, it translates to a different hash. Using a different hash allows the malware to avoid detection by simple black / white listing rules used by standard antivirus programs.
The original icons are altered in many different ways. The most common methods are single pixel value change, random noise introduction, slight color changes, angular translation and element resize. Here are five examples of an altered PDF icon. The original icon is at the far left.
Turns out that different classes of threats tend to masquerade as different programs, as the pie charts below show. The malicious files were divided into the categories of ransomware, malware, unwanted and hack tools. Each poses a different level of risk. The following pie charts break down the threat class distribution on different types of application icons:
It’s interesting to note that while multiple types of malicious files use PDF and Chrome as cover, ransomware may try to masquerade as Windows figures (which is not even a program icon).
Most likely, icons are selected based on the attack vector and what icon a user is most likely to trust and click on. Overall, the icon is a key element that the malware creators use to fool users into executing a file.
Masquerading as a different file entity is another interesting method that adversaries use to maximize the chances for getting a person to activate the malware, even if the file is transferred by other methods, such as flash drive. Hiding malware as a folder is a common practice and it’s quite smart. People typically think that looking through an unknown folder is harmless since most users have their view settings setup so they just see the file name and not the file extension. But, in this case, opening what appears to be an unknown folder to explore its contents can lead to disastrous results.
The detection process can be divided into two stages. The first deals with detecting icons that masquerade as popular programs, and the second decides whether the file is malicious. Each stage utilizes a different machine learning classification algorithm as a “decision maker.”
During the first stage, the “visual essence” of the icon is extracted and compared to icons of popular applications. A computer vision algorithm generates a multidimensional quantitative metric, which is used by the machine learning algorithm through the classification process.
Several machine learning classification algorithms, including KNN, SVM, K-means and Random Forest, were considered for the task of associating the icon to specific application. Two key metrics were used in the benchmarking process: precision and recall. They’re defined below. These two metrics are of negative correlation and our goal is to maximize the recall while avoiding false positives by keeping precision high.
Multiple tests were performed using different lab-produced and real-world data sets. Eventually, KNN (K nearest neighbors) was chosen for its ability to perform well on moderate scale dataset of non-linear nature and good decision explainability.
The following scheme is a visual demonstration of the classifier concept, projected on a two-dimensional plane. Every node represents a unique icon figure and every edge represent the connection between a node and its nearest neighbor.
The three clusters represent icons of three different applications (PDF, Chrome and generic installer). The K nearest neighbors in terms of Euclidean distance will vote for the new sample label, given that their distance is smaller than a predefined threshold.
A new sample is positioned on the n dimensional space, according to its coordinates values (the independent red round). Let’s assume that the three closest neighbors to the red sample are close enough to vote. In this case, there is one vote for PDF and two votes for Chrome, one of which is closest to the new sample. Therefore, the new sample will be labeled as Chrome.
Once an icon is flagged as “masquerading”, more file information is gathered and fed into the classifier in the second stage, which then predicts if the file is malicious. This nested classification architecture yielded high performance, yet simple and explainable prediction process.
The danger of phishing attacks is already well known. Phishing becomes an even more significant threat when you add the prospect of malicious files that masquerade as trusted programs. Slightly altering the icons of well-known applications is an easy and effective way to ensure a phishing campaign’s success.
However, advanced computer vision algorithms, combined with machine learning algorithms, can tell if a seemingly legitimate file is really malicious based on the icon’s behavior and block any attack. Dealing with today’s and tomorrow’s threats requires advanced behavioral malware detection. Standard antivirus programs, while still useful in some cases, can’t detect this type of threat.
Antivirus software checks the signatures and hashes of files and compare them to black and white lists. If the file is unknown, it isn’t blocked. These days, attackers usually generate a new sample of malware so that the file's hash is new and not blacklisted (even slightly changing an the icon, which is a part of the file, can do the trick). This means that hackers can easily bypass antivirus programs. Behavioral detection methods, by contrast, don’t use familiar external indicators like signatures or hashes. Instead, algorithms are used to determine if an unknown file is malicious.
Data Science @ Cybereason Innovation Labs.