Uncovering Unknown Threats With Human-Readable Machine Learning
Dr. Marco Balduzzi, Senior Researcher, Forward-Looking Threat Research Team
Aided by machine learning, we analyzed data on 3 million software downloads from hundreds of thousands of internet-connected machines. In our previous blog posts for this three-part series, we explored key aspects of software downloads in the wild. We looked into the major domains from where different malware categories were downloaded and discussed which client applications were mostly targeted by malware infection. We also looked at code signing abuse and examined certain certification authorities that were found with certificates that were used for signing malicious code. In this blog post, we will discuss how we developed a human-readable machine learning system that is able to determine whether a downloaded file is benign or malicious in nature.
The development of this actionable intelligent system stemmed from the question: How can we make our knowledge about global software download events actionable? More specifically, how can we use such information to do a better job at detecting the threats posed by the large amounts of new malicious software circulating on a daily basis?
In this last installment of this blog series, we will answer such questions and give a summary of what we did with the information we’ve obtained. Our research paper titled Exploring the Long Tail of (Malicious) Software Downloads provides a more comprehensive look into how we’ve gathered and analyzed our software downloads data.
Exploration: Majority of downloaded files are still unknown
We begin with a simple observation: 83% of downloads that we observed in the wild were unknown. This means that the downloaded files are undetected, i.e., found neither benign nor malicious.
Keep in mind the following considerations:
- This is limited to the data set that we used for our research. Our first blog contains the details.
- This is limited to our best effort in labeling the download data. We made use of internal proprietary systems as well as publicly available services.
Due to the nature of our data set, an important observation we noted was that most of these files have very low prevalence. When considering the files individually, overall, each file is downloaded by only a few machines. Therefore, one may think that these files are uninteresting, and the fact that they remain unknown is understandable.
However, if we consider the number of machines, we find that 69% of the entire machine population downloaded one or more unknown files: If these had been malware, hundreds of thousands of machines would have been infected.
Of course, this raises important concerns on the actual effectiveness of large-scale real-world malware detection and classification systems deployments, and their ability to defend internet-connected machines from the emergence of new threats — especially as it appears that many of these remain undetected.
Detection: From observation to automatic detection to reduce unknown files
The goal of our research was to reduce the number of unknown downloads, given its substantial volume.
We did that by condensing the observations drawn from our study into an actionable intelligent system. This system “ingests” these observations (for example, observations on malicious signers) and automatically produces detection rules for each one. These rules are immediately applicable and have very high detection rates — at least according to our experimental results. A rule is therefore a combination of information and will look, for example, like the following:
IF (the file’s signer is “Apps Installer S.L.” AND its downloading process’s signer is “Microsoft Windows” AND the file’s certification authority (CA) is “thawte code signing CA – g2”) → MALICIOUS
The pieces of information consumed by the system, i.e., features, are as follows:
- Signer, CA, and packer of the downloaded file
- Signer, CA, and packer of the downloading process
- Class of the downloading process (browser, Windows, Java, etc.)
- Popularity of the download domain
This system generated 1,500[1] novel detection rules per month — which reduced the number of unknown downloads by 28%.[2] By counting the number of machines that downloaded these files, which amounted to 31%,[3] our system proved to be an essential tool in protecting almost a third of the total population of machines from new malware infections.
System details: A human-readable system that keeps false positives at bay
Given the importance machine learning has gained in the security industry, we think it’s necessary to share a few words to discuss the internal workings of our system. We designed our system with two main goals in mind:
- Generating detection rules that are human readable. For us, being able to explain why a certain software is either benign or malicious is important. In fact, customers and users in general, are more and more interested in knowing how they have been targeted – that is, the context around the infection rather than the infection itself.
- Common machine-learning algorithms — like support vector machines (SVMs) and neural networks — suffer from “un-interpretability,” which makes the results difficult to analyze, observe, or understand. To overcome this limitation, we used the PART rule learning algorithm to derive a set of human-readable classification rules based on the features listed above (downloaded software, downloading process, and download domain).
- Keeping the number of false positives (errors) as low as possible. This aspect is very important in cybersecurity operations where thousands of unknown and new software downloads (and potential threats) are observed per day.
-
- To do that, we used only a subset of all the rules generated by our PART algorithm, i.e., by including only the rules with error rates less than a maximum (configurable) error threshold τ. For example, for one month of training window Ttr and by choosing the rules that have no training error (τ=0.0%), 1,148 rules out of 1,680 rules were selected.
- The following table reports the statistical information about the extracted rules during different windows Ttr:
Ttr Overall no. of rules τ Selected rules Rules composition No. of benign No. of malicious Feb 1,766 0.0% 1,020 889 131 0.1% 1,031 894 137 Mar 1,680 0.0% 1,148 970 178 0.1% 1,162 976 186 Apr 1,272 0.0% 1,054 872 182 0.1% 1,070 875 195 May 1,476 0.0% 974 791 183 0.1% 986 793 193 Jun 944 0.0% 740 577 163 0.1% 753 585 168 Jul 1,376 0.0% 937 755 182 0.1% 953 763 190 -
[1] Average number. Averaged based on seven months’ worth of data.
[2] Average number. Averaged based on seven months’ worth of data.
[3] Average number. Averaged based on seven months’ worth of data.
Through this blog series, we sought to elaborate on our work on software downloads and its potential application to cybersecurity solutions. We started by looking at how malware campaigns are operated – both technically and economically – and how they affect organizations. We also looked at the phenomenon of code signing abuse and how criminals misuse it in the underground. In this concluding piece, we saw how domain expertise can be made actionable as a way of protecting our customers from the threats posed by the large amount of new and undetected malicious software circulating in the wild. Through a system of classification that uses machine learning technology to analyze unknown files, we can determine whether they are benign or malicious in nature. This human-readable machine learning system, as well as other pertinent findings on large-scale global download events, is discussed in more detail in our research paper titled Exploring the Long Tail of (Malicious) Software Downloads.
The post Uncovering Unknown Threats With Human-Readable Machine Learning appeared first on .
Read more: Uncovering Unknown Threats With Human-Readable Machine Learning