Malware Classification with ‘Graph Hash,’ Applied to the Orca Cyberespionage Campaign

By Chia Ching Fang and Shih-Hao Weng (Threat Researchers)

In malware research, threat hunting and sharing of threat intelligence, such as exchanging indicators of compromise (IoCs) in the form of hashes (e.g., MD5s, SHA256s), are common industry practices and helpful for information security professionals. Researchers, for instance, would typically search for malware samples on VirusTotal using hashes. However, hashes have some characteristics that could limit researchers trying to do file or threat correlation, such as the one-to-one relationship between a file and its hash. To overcome this, other hashing techniques, methodologies, and tools have been proposed, such as ssdeep, sdhash, imphash, and even our own Trend Micro Locality Sensitive Hashing (TLSH) — and they can indeed help researchers find and identify the similarities between binary files. These approaches use binary as a point of view.

Alternatively, researchers use other methodologies, such as using graphs as a point of view, to compare an executable file’s characteristics with others. For example, zynamics’ BinDiff does this by taking and viewing a bigger picture of an executable to see the similarities or differences between two executable files. This approach processes two files at the same time.

Our research, which we’ve named “Graph Hash,” builds on the advantages of these two approaches by calculating the hash of executable files using a graph view, which would help in classifying malware more consistently and efficiently. Our research aims to provide a viable approach to malware classification, which, in turn, can help in the sharing of actionable threat intelligence beyond simple checksums, such as MD5s and secure hash algorithm (SHA) families.

We applied Graph Hash to the Orca cyberespionage campaign to demonstrate this method’s viability.

Graph Hash: Viewing Executables Through Call Graphs

Graph Hash uses a call graph to view executables. A call graph, essentially the executable’s workflow, is generated by creating a vertex graph for each of the executable’s functions and then connecting the vertices of the function based on the existence of a call that cross-references one function to another. Figure 1 is an example of a call graph.

Figure 1. An example of a call graph plotted via Interactive Disassembler Pro (IDA Pro)

As shown in the figure above, executables that have the same workflow will share the same call graph. This, in a nutshell, is how we use a call graph to classify executables. After creating the call graph, we convert it into what we call a “call graph pattern” (CGP). CGP is a binary-based data format presented through its corresponding call graph. With CGP, we can get a checksum, carry out fuzzy hashing, and perform search or similarity analysis.

How Call Graphs Work

A call graph is composed of vertices and edges, where vertices represent an executable’s functions and edges indicate the relationship between vertices (e.g., function A will call function B). For instance, we can take two properties of functions and define the “values” of these functions. The vertex can be added with color to better distinguish call graphs more easily.

Figure 2. An example of a vertex

Let’s take Figure 2 as an example. The value is 16 bits (binary), with 0–7 storing the address block. We can then do the following to acquire the address block:

  1. Divide the executable address information into 16 blocks from top to bottom (this means that there is block 0 to block 15).
  2. Given a start address of a function belonging to block 0, we assign the function’s address block as 0, and so on.

In an 8-15-bit address block that stores functions, there are certain types. Note that the function types are based on BinExport (a sub-project of BinDiff):

  • Type 0 — Regular function with full disassembly and isn’t a library or imported function.
  • Type 1 — A well-known library function.
  • Type 2 — An imported function from a dynamic link library.
  • Type 3 — A thunk function that forwards its work via an unconditional jump.
  • Type 4 — An invalid function.

Furthermore, there is a special function among a call graph’s functions that we named “root,” which is the function that an executable starts with. With these, we could use vertices and edges to describe a call graph.

The next step involves doing a call graph traversal, whose build up is what we call CGP. The traversal algorithm we use is Depth-First Search.

There is one concept that we want to emphasize here, which is the “reuse” characteristic of functions. It means that one function in an executable might be called more than once. A call graph view, for example, will go through this kind of vertex and its descendant more than once. We want to keep this kind of path in CGP, however, in order to reduce duplication, improve efficiency, and avoid having to go through all paths from visited ancestor to descendant. We also keep the visited ancestor in the CGP simply to present the reuse relationship.

Figure 3. An example of simple call graph

Figure 3 is an example of a simple call graph. To further demonstrate, the following is an example of how values were assigned to the vertices, root, and edges:

  • Vertices: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
  • Root Vertex: {5}
  • Edges: {1, 9} {2, 0} {5, 9} {5, 6} {6, 1} {8, 3} {8, 4} {9, 7} {9, 8} {9, 2}

In this example, the graph hashing will be “59783420619” and not “5978342061978342”. With this, we can have a CGP to present the related call graph. With CGP, we could calculate the graph checksum and perform fuzzy hashing. In an MD5 of a CFG, for example, it can be labeled “Graph MD5,” while an ssdeep of CFG can be labeled “Graph ssdeep.”

Applying Graph Hash to the Orca Cyberespionage Campaign

To demonstrate Graph Hash, we applied it to the Orca cyberespionage campaign, which we have been tracking since 2011 (and disclosed in 2017). The table below summarizes our findings based on 322 samples of executables that the group purportedly used against their targeted organizations. We were also able to classify 10 malware families based on their tokens, comprised of the communications protocols or command-and-control (C&C) servers that the malware used. We used Graph Hash to calculate the grouping/similarity rates of samples related to Orca’s attacks. The results are in the table below.

Grouping/Similarity Rate of the Samples Groups
Graph MD5 81% (260/322) 71
Graph ssdeep 85% (274/322) 67
File ssdeep 66%   (211/322) 62
By Malware Handler 100% (322/322) 10

Table 1. Graph Hash grouping of samples related to Orca campaign

Of course, Graph Hash has its limitations. For example, this approach heavily relies on the use of IDA Pro, and that CGP’s ability to recognize malware packer routines is limited. We are, however, still fine-tuning Graph Hash and extending its capabilities to ELF and Mach-O files, as well as frameworks or tools for reverse engineering.

Graph Hash was demonstrated in the Hack in the Box GSEC Conference 2019 during the “What Species of Fish Is This? Malware Classification with Graph Hash” session on August 29, 2019 at InterContinental Singapore.

The post Malware Classification with ‘Graph Hash,’ Applied to the Orca Cyberespionage Campaign appeared first on .

Read more: Malware Classification with ‘Graph Hash,’ Applied to the Orca Cyberespionage Campaign

Story added 6. September 2019, content source with full text you can find at link above.