Uncovering Unknown Threats With Human-Readable Machine Learning

Dr. Marco Balduzzi, Senior Researcher, Forward-Looking Threat Research Team

Aided by machine learning, we analyzed data on 3 million software downloads from hundreds of thousands of internet-connected machines. In our previous blog posts for this three-part series, we explored key aspects of software downloads in the wild. We looked into the major domains from where different malware categories were downloaded and discussed which client applications were mostly targeted by malware infection. We also looked at code signing abuse and examined certain certification authorities that were found with certificates that were used for signing malicious code. In this blog post, we will discuss how we developed a human-readable machine learning system that is able to determine whether a downloaded file is benign or malicious in nature.

The development of this actionable intelligent system stemmed from the question: How can we make our knowledge about global software download events actionable? More specifically, how can we use such information to do a better job at detecting the threats posed by the large amounts of new malicious software circulating on a daily basis?

In this last installment of this blog series, we will answer such questions and give a summary of what we did with the information we’ve obtained. Our research paper titled Exploring the Long Tail of (Malicious) Software Downloads provides a more comprehensive look into how we’ve gathered and analyzed our software downloads data.

Exploration: Majority of downloaded files are still unknown

We begin with a simple observation: 83% of downloads that we observed in the wild were unknown. This means that the downloaded files are undetected, i.e., found neither benign nor malicious.

Keep in mind the following considerations:

This is limited to the data set that we used for our research. Our first blog contains the details.
This is limited to our best effort in labeling the download data. We made use of internal proprietary systems as well as publicly available services.

Due to the nature of our data set, an important observation we noted was that most of these files have very low prevalence. When considering the files individually, overall, each file is downloaded by only a few machines. Therefore, one may think that these files are uninteresting, and the fact that they remain unknown is understandable.

However, if we consider the number of machines, we find that 69% of the entire machine population downloaded one or more unknown files: If these had been malware, hundreds of thousands of machines would have been infected.

Of course, this raises important concerns on the actual effectiveness of large-scale real-world malware detection and classification systems deployments, and their ability to defend internet-connected machines from the emergence of new threats — especially as it appears that many of these remain undetected.

Detection: From observation to automatic detection to reduce unknown files

The goal of our research was to reduce the number of unknown downloads, given its substantial volume.

We did that by condensing the observations drawn from our study into an actionable intelligent system. This system “ingests” these observations (for example, observations on malicious signers) and automatically produces detection rules for each one. These rules are immediately applicable and have very high detection rates — at least according to our experimental results. A rule is therefore a combination of information and will look, for example, like the following:

IF (the file’s signer is “Apps Installer S.L.” AND its downloading process’s signer is “Microsoft Windows” AND the file’s certification authority (CA) is “thawte code signing CA – g2”) → MALICIOUS

The pieces of information consumed by the system, i.e., features, are as follows:

Signer, CA, and packer of the downloaded file
Signer, CA, and packer of the downloading process
Class of the downloading process (browser, Windows, Java, etc.)
Popularity of the download domain

This system generated 1,500[1] novel detection rules per month — which reduced the number of unknown downloads by 28%.^[2] By counting the number of machines that downloaded these files, which amounted to 31%,^[3] our system proved to be an essential tool in protecting almost a third of the total population of machines from new malware infections.

System details: A human-readable system that keeps false positives at bay

Given the importance machine learning has gained in the security industry, we think it’s necessary to share a few words to discuss the internal workings of our system. We designed our system with two main goals in mind:

Generating detection rules that are human readable. For us, being able to explain why a certain software is either benign or malicious is important. In fact, customers and users in general, are more and more interested in knowing how they have been targeted – that is, the context around the infection rather than the infection itself.
1. Common machine-learning algorithms — like support vector machines (SVMs) and neural networks — suffer from “un-interpretability,” which makes the results difficult to analyze, observe, or understand. To overcome this limitation, we used the PART rule learning algorithm to derive a set of human-readable classification rules based on the features listed above (downloaded software, downloading process, and download domain).

Keeping the number of false positives (errors) as low as possible. This aspect is very important in cybersecurity operations where thousands of unknown and new software downloads (and potential threats) are observed per day.

1. To do that, we used only a subset of all the rules generated by our PART algorithm, i.e., by including only the rules with error rates less than a maximum (configurable) error threshold τ. For example, for one month of training window T_tr and by choosing the rules that have no training error (τ=0.0%), 1,148 rules out of 1,680 rules were selected.
2. The following table reports the statistical information about the extracted rules during different windows T_tr:

T_tr	Overall no. of rules	τ	Selected rules	Rules composition
T_tr	Overall no. of rules	τ	Selected rules	No. of benign	No. of malicious
Feb	1,766	0.0%	1,020	889	131
Feb	1,766	0.1%	1,031	894	137
Mar	1,680	0.0%	1,148	970	178
Mar	1,680	0.1%	1,162	976	186
Apr	1,272	0.0%	1,054	872	182
Apr	1,272	0.1%	1,070	875	195
May	1,476	0.0%	974	791	183
May	1,476	0.1%	986	793	193
Jun	944	0.0%	740	577	163
Jun	944	0.1%	753	585	168
Jul	1,376	0.0%	937	755	182
Jul	1,376	0.1%	953	763	190

[1] Average number. Averaged based on seven months’ worth of data.

[2] Average number. Averaged based on seven months’ worth of data.

[3] Average number. Averaged based on seven months’ worth of data.

Through this blog series, we sought to elaborate on our work on software downloads and its potential application to cybersecurity solutions. We started by looking at how malware campaigns are operated – both technically and economically – and how they affect organizations. We also looked at the phenomenon of code signing abuse and how criminals misuse it in the underground. In this concluding piece, we saw how domain expertise can be made actionable as a way of protecting our customers from the threats posed by the large amount of new and undetected malicious software circulating in the wild. Through a system of classification that uses machine learning technology to analyze unknown files, we can determine whether they are benign or malicious in nature. This human-readable machine learning system, as well as other pertinent findings on large-scale global download events, is discussed in more detail in our research paper titled Exploring the Long Tail of (Malicious) Software Downloads.

The post Uncovering Unknown Threats With Human-Readable Machine Learning appeared first on .