BotGraph: Large Scale Spamming Botnet Detection

Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, E. Gillum, “BotGraph: Large Scale Spamming Botnet Detection,” NSDI’09, (April 2009). [PDF]

Summary

Analyzing large volume of logged data to identify abnormal patterns is one of the biggest and most frequently faced challenges in the network security community. This paper presents BotGraph, a mechanism based on random graph theory, to detect botnet spamming attacks on major web email providers based on millions of log entries and describes its implementation as a data parallel system. The authors posit that even though individually detecting bot-users (false email accounts/users created by botnets) is difficult, they do have some aggregate behavior (e.g., they share IP addresses when they log in and send emails). BotGraph detects abnormal sharing of IP addresses among bot-users to separate them from human-users.

BotGraph has two major components: aggressive sign-up detection that identifies sudden increase of signup activities from the same IP address and stealthy bot detection that detects sharing of one IP address by many bot-users as well as sharing of many IP addresses by a single bot-user by creating a user-user graph. In this paper, the authors consider multiple IP addresses from the same AS as one shared IP address. It is also assumed that legitimate groups of normal users normally do not use the same set of IP addresses from different ASes.

In order to identify bot-user groups, the authors create a user-user graph where each user is a vertex and each link between two vertices carry some weight based on a similarity metric of the two vertices. The authors’ assumption is that the bot-users will create a giant connected component in the BotGraph (since they will share IP addresses) that will collectively distinguish them from the normal users (who create much smaller connected components). They show that there exists some threshold on edge weights, which, if decreased, will suddenly result in large components. It is proven, using random graph theory, that if there is IP address sharing than the giant component will be seen with a high probability.

The detection algorithm works in an iterated manner starting with a smaller threshold value to create a large component, and it then recursively increases the threshold to extract connected sub-components (until the size of the connected components become smaller than some constant). The resultant output is a hierarchical tree. The algorithm then uses some statistical measurements (using histograms) to separate bot-user group components from their real counterparts (this part of the algorithm is questionable!). After the pruning stage, BotGraph goes through another phase of traversing the hierarchical tree to consolidate bot-user group information.

The problem in applying BotGraph arrives first as the complexity arising from the construction of the graph, which can contain millions of vertices, and then in terms of processing the huge graph to actually run the algorithm for detection. The authors propose two approaches using currently popular large data parallel processing mechanism MapReduce and its extension using selective filtering with two more interfaces than just basic Map and Reduce. The extended version is found to be the more resource efficient, scalable, and faster of the two.

Evaluation using the data parallel systems show that BotGraph could identify a large number of previously unknown bot-users and bot-user accounts with low false positive rates.

Critique

The whole concept seems to be standing on the shoulder of the assumption that botnet operators are “stupid”. The criteria used in the paper are pretty straightforward, and it shouldn’t be too hard for the botnet operators to figure them out and work around them.

The MapReduce part in the middle of the paper, which took 4 pages, seems not SO new; they could’ve just left it out of the paper or at least shorten that part and save everyone some time. After all, MapReduce was created to take care of Tera bytes of data; making it work for 500 GB data should not be that awe-inspiring. To give the authors credit, it should be mentioned that most probably they are the first ones to use it in this context.

One more thing, I wish someone could publish what GMail does to filter spams because I can bet they are pretty darn good at what they are doing and much ahead of other web mail providers.

Leave a Reply

Your email address will not be published. Required fields are marked *