DefPloreX: A Machine-Learning Toolkit for Large-scale eCrime Forensics
By Marco Balduzzi and Federico Maggi
The security industry as a whole loves collecting data, and researchers are no different. With more data, they commonly become more confident in their statements about a threat. However, large volumes of data require more processing resources, as extracting meaningful and useful information from highly unstructured data is particularly difficult. As a result, manual data analysis is often the only choice, forcing security professionals like investigators, penetration testers, reverse engineers, and analysts to process data through tedious and repetitive operations.
We have created a flexible toolkit based on open-source libraries for efficiently analyzing millions of defaced web pages. It can also be used on web pages planted as a result of an attack in general. Called DefPloreX (a play on words from “Defacement eXplorer”), our tool uses a combination of machine-learning and visualization techniques to turn unstructured data into meaningful high-level descriptions. Real-time information on incidents, breaches, attacks, and vulnerabilities are efficiently processed and condensed into browsable objects that are suitable for efficient large-scale e-crime forensics and investigations.
DefPloreX ingests plain and flat tabular files (e.g., CSV files) containing metadata records of web incidents under analysis (e.g., URLs), explores their resources with headless browsers, extracts features from the deface web pages, and stores the resulting data to an Elastic index. The distributed headless browsers, as well as any large-scale data-processing operation, are coordinated via Celery, the de-facto standard for distributed task coordination. Using a multitude of Python-based data-analysis techniques and tools, DefPloreX creates offline “views” of the data, allowing easy pivoting and exploration.
The most interesting aspect of DefPloreX is that it automatically groups similar defaced pages into clusters, and organizes web incidents into campaigns. Requiring only one pass on the data, the clustering technique we use is intrinsically parallel and not memory bound. DefPloreX offers text- and web-based UIs, which can be queried using a simple language for investigations and forensics. Since it’s based on Elastic Search, the data DefPloreX produces can be easily integrated with other systems.
Here is an example of how an analyst could use DefPloreX to investigate a campaign called “Operation France” (with “#opfrance” being the Twitter handler associated with it). This campaign is operated prevalently by online Muslim activists with the goal of supporting radical Islamism.
As Figure 1 shows, this campaign targeted 1,313 websites over a period of 4 years (2013-2016), mainly targeting French domain names (Figure 2). DefPloreX reveals the actors’ composition and the deface templates used in the attacks (Figure 3). Some of these members explicitly support the attacks against France conducted by radical Islamists (e.g., in terrorism) (Figure 4).
Figures 1-4. Investigation example for campaign Operation France (#opfrance) (Click to enlarge)
DefPloreX supports the analyst in the following operations:
- importing and exporting generic data into and from an Elastic index
- enriching the index with various attributes
- visiting web pages in an automated, parallel fashion to extract numerical and visual features that capture the structure of the HTML page and its appearance when rendered
- post-processing the numerical and visual features to extract a compact representation that describes each web page (we call this representation a “bucket”)
- using the compact representation to pivot the original web pages, grouping them into clusters of similar pages
- performing generic browsing and querying of the Elastic index.
The following diagram shows the architecture of DefPloreX:
Figure 5. Overview of DefPloreX capabilities
From each web page, we wanted to collect two sides of the same story: a “static” view of the page (e.g., non-interpreted resources, scripts, text) and a “dynamic” view of the same page (e.g., a rendered page with DOM modifications and so on). The full version of DefPloreX can extract URLs, e-mail addresses, social-network nicknames and handles, hashtags, images, file metadata, summarized text, and other information. This data captures the main characteristics of any defaced web page.
Figure 6. Collection of data from URLs
We approached the problem of finding groups of related defacement web pages (e.g., hacktivism campaigns) as a typical data-mining problem. We assume that there are recurring and similar characteristics among these pages that we can capture and use as clustering features. For example, we assume that the same attacker will reuse the same web snippets (albeit with minimal variations) within the same campaign. We capture this and other aspects by extracting numerical and categorical features from the data we obtained by analyzing each page (static and dynamic view).
Figure 7. Features obtained from each URL
DefPloreX also sports a feature called “data bucketing,” which we use to derive a compact representation of each record. This compact representation is then used to enable fast clustering. In our case, a record is an instance of a defaced page, but this method can be applied to other domains. When applied to numeric features, this bucketing functionality represents a real number (of any range) by using only a limited set of categorical values (i.e., low, medium, high).
Elastic search natively supports the statistics primitives (e.g., percentiles) required to perform this transformation from numerical values to categorical values. If it’s applied to features that are originally categorical (e.g., character encoding used in a web page), this bucketing functionality represents all existing encoding schemes (e.g., “windows-1250,” “iso-*”), with the geographical region in which each encoding is typically used (e.g., European, Cyrillic, Greek). The same can be done for spoken languages, TLDs, and so on.
The web-based UI is based on React, backed by a lightweight REST API written in Flask. The web-based UI is essentially a “spreadsheet on steroids,” in the sense that smart pagination allows it to be scaled up to an arbitrary number of records. The main task, fulfilled by the web-based UI, is that of browsing through clusters and records. For example, to spot a web-defacement campaign coordinated by the same (small) group of cyber-criminals, we would query DefPloreX to display clusters with at most ten attackers and inspect the timeline of each cluster to spot periodical patterns or spikes in activity, revealing coordinated attacks.
In all of its operations, DefPloreX keeps the amount of memory used to the bare minimum without sacrificing performance. DefPloreX works well on a simple laptop but can scale up when more computing resources are available.
Figures 8-11. Example of DefPloreX usage (Click to enlarge)
Following our talk at this year’s Black USA Arsenal in Las Vegas on July 27, we released part of DefPloreX under FreeBSD License on Github. The released toolkit consists of a framework library for large-scale computation of Elasticsearch’s records. A copy of our presentation may be found here.