Category Archives: Reviews

BotGraph: Large Scale Spamming Botnet Detection

Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, E. Gillum, “BotGraph: Large Scale Spamming Botnet Detection,” NSDI’09, (April 2009). [PDF]

Summary

Analyzing large volume of logged data to identify abnormal patterns is one of the biggest and most frequently faced challenges in the network security community. This paper presents BotGraph, a mechanism based on random graph theory, to detect botnet spamming attacks on major web email providers based on millions of log entries and describes its implementation as a data parallel system. The authors posit that even though individually detecting bot-users (false email accounts/users created by botnets) is difficult, they do have some aggregate behavior (e.g., they share IP addresses when they log in and send emails). BotGraph detects abnormal sharing of IP addresses among bot-users to separate them from human-users.

BotGraph has two major components: aggressive sign-up detection that identifies sudden increase of signup activities from the same IP address and stealthy bot detection that detects sharing of one IP address by many bot-users as well as sharing of many IP addresses by a single bot-user by creating a user-user graph. In this paper, the authors consider multiple IP addresses from the same AS as one shared IP address. It is also assumed that legitimate groups of normal users normally do not use the same set of IP addresses from different ASes.

In order to identify bot-user groups, the authors create a user-user graph where each user is a vertex and each link between two vertices carry some weight based on a similarity metric of the two vertices. The authors’ assumption is that the bot-users will create a giant connected component in the BotGraph (since they will share IP addresses) that will collectively distinguish them from the normal users (who create much smaller connected components). They show that there exists some threshold on edge weights, which, if decreased, will suddenly result in large components. It is proven, using random graph theory, that if there is IP address sharing than the giant component will be seen with a high probability.

The detection algorithm works in an iterated manner starting with a smaller threshold value to create a large component, and it then recursively increases the threshold to extract connected sub-components (until the size of the connected components become smaller than some constant). The resultant output is a hierarchical tree. The algorithm then uses some statistical measurements (using histograms) to separate bot-user group components from their real counterparts (this part of the algorithm is questionable!). After the pruning stage, BotGraph goes through another phase of traversing the hierarchical tree to consolidate bot-user group information.

The problem in applying BotGraph arrives first as the complexity arising from the construction of the graph, which can contain millions of vertices, and then in terms of processing the huge graph to actually run the algorithm for detection. The authors propose two approaches using currently popular large data parallel processing mechanism MapReduce and its extension using selective filtering with two more interfaces than just basic Map and Reduce. The extended version is found to be the more resource efficient, scalable, and faster of the two.

Evaluation using the data parallel systems show that BotGraph could identify a large number of previously unknown bot-users and bot-user accounts with low false positive rates.

Critique

The whole concept seems to be standing on the shoulder of the assumption that botnet operators are “stupid”. The criteria used in the paper are pretty straightforward, and it shouldn’t be too hard for the botnet operators to figure them out and work around them.

The MapReduce part in the middle of the paper, which took 4 pages, seems not SO new; they could’ve just left it out of the paper or at least shorten that part and save everyone some time. After all, MapReduce was created to take care of Tera bytes of data; making it work for 500 GB data should not be that awe-inspiring. To give the authors credit, it should be mentioned that most probably they are the first ones to use it in this context.

One more thing, I wish someone could publish what GMail does to filter spams because I can bet they are pretty darn good at what they are doing and much ahead of other web mail providers.

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

R. Gummadi, H. Balakrishnan, P. Maniatis, S. Ratnasamy, “Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks,” NSDI’09, (April 2009). [PDF]

Summary

In recent years, botnets have become the major originators of email spams, DDoS attacks, and click-frauds on advertisement-based web sites. This paper argues that separating human-generated traffic from botnet-generated activities can improve reliability of various web-based services against botnet attacks. But identifying human-generated traffic in the absence of strong unique identities can be challenging. In this paper, the authors propose NAB (Not-A-Bot), a system to approximately identify and certify human-generated activity in a non-intrusive way.

NAB Architecture

NAB consists of an attester, a small trusted software, that runs locally at a host (isolated from the untrusted OS) and generates attestations corresponding to a request from an application, as well as an external verifier that validates these attestations in a distributed site. There are four main requirements that drive the NAB architecture (attester and verifier) design:

  1. Attestations must be generated in response to human requests automatically.
  2. Attestations must not be transferable from the client on which they are generated to attest traffic originating from another client.
  3. NAB must benefit users that deploy it without hurting those that do not.
  4. NAB must preserve the existing privacy and anonymity semantics of applications

Requirements 1 and 2 are implemented/met in the attester and requirements 3 and 4 are ensured in the verifier.

Attester

The attester runs on a trusted computing base (TCB), which is implemented by taking advantage of the Trusted Platform Module (TPM) available is most modern systems. The authors use TPM to create a trusted path between physical input devices and the human activity attester.

The attester’s sole purpose is to create attestations – when asked for by an application – for legitimate human activity. The authors used a simple t-δ attester, where a attestation is created if there is any input activity in last δ time units. Even though there is a possibility of forging/harvesting user activity in this simpler approach, the authors argue that the botnet will be limited by human activity frequency, which will decrease the number of attacks.

NAB generates responder-specific, content-specific, and if appropriate, challenger-specific attestations and employs existing cryptographic methods to secure them. It also ensures that attestations cannot be double-spent and cannot be misused by botnets (for a very limited time window botnets can forge attestations)

Verifier

Verifier is co-located with the server processing requests. When invoked, the verifier is passed both the attestation and the request. Based on these information (plus the crypto-thing that are in the paper), the verifier checks the validity of the attestation. The authors also discussed different application-specific spam, DDoS, and click-fraud verification policies.

Evaluation Results

Major results from the evaluation are:

  • TCB size can be really small (500 SLOC)
  • Attester CPU cost is 107 instructions/attestations
  • For simple application changes less than 250 SLOC changes is enough to enable them of NAB
  • In the worst case, NAB can suppress 92% spam, 89% non-human possibly DDoS activity, and 87% automated clicks without false positives
  • The verifier can withstand 100,000 bot DDoS and can handle more than 1000 requests/second

Discussion

The authors have a done good job in discussing several alternatives in multiple cases instead of sticking to their chosen one, which provides insights into botnet behavior and explains why some solutions might not work. There solution still has some loopholes, but possibly will work to some extent. The requirement of changing all the applications could be a turn off. However, the fact that NAB does not discriminate against unattested traffic makes that problem go away and allows incremental deployment.

One problem that I can think of right now regarding this approach is that the applications are not in the TCB and the requests for attestations do not seem to be going through a trusted channel. What are the fallback options if a malicious entity corrupts/removes attestations just after they leave the trusted base of the attester? Also, the verifiers do not seem to be in a TCB. What if it is compromised?

Scalable Application Layer Multicast

S. Banerjee, B. Bhattacharjee, C. Kommareddy, “Scalable Application Layer Multicast,” ACM SIGCOMM Conference, (August 2002). [PDF]

Summary

Deployment of network layer multicast requires support from the underlying infrastructure. Like many other proposals/protocols with similar requirements, it faced the same fate: no one deployed it. This paper presents a scalable application layer multicast protocol, NICE, designed for low-bandwidth data streaming applications with large receiver sets. Data packets in NICE are replicated at end hosts, which makes it technically an overlay, and so it does not need support from the network providers.

The NICE protocol arranges the set of end hosts into a hierarchy and maintains the hierarchy through different events. The members at the very top of the hierarchy maintain (soft) state about O(log N) other members. Logically each member keeps detailed state about other members that are near (calculated based on end-to-end latency) in the hierarchy and only has limited knowledge about other members in the group. Hosts in each layer are partitioned into a set of clusters of sizes between k and 3k-1, where k is a constant. Each cluster has a leader that is the graph-theoretic center of the cluster. The leaders in layer Li are the members of the layer Li+1, and the same process continues. The following properties hold for the distribution of hosts in different layers:

  • A host belongs to a single cluster at any layer.
  • If a host is present in layer Li, it must be the cluster leader in each of the lower layers (Li-1…L0).
  • If a host is not present in layer Li, it cannot be present in any higher layers (Li+1…).
  • Each cluster is size bounded between k and 3k-1.
  • There are atmost logk N layers, and the highest layer has only a single member.

There are separate control and data planes in NICE. The neighbors on the control topology exchange periodic soft state refreshes and form a clique in each layer. The data topology in each layer is a source-specific tree.

NICE assumes a special host, rendezvous point (RP), which is known a-priori to all members. When a new host wants to join a multicast group, it contacts the RP. The RP responds by returning the members in the highest layer. The joining node then contacts the closest member, which refers to the members of its cluster in the lower layer (the closest member is here because it is a leader in the lower layer). This process continues iteratively until the joining node finds its appropriate location in the lowest layer. The message overhead of the joining process is O(k log N) query-response pairs. When a node leaves, it removes itself from all the layers and new leaders are chosen. Periodic heartbeat messages are used to keep the state information fresh and refine (split/merge) the hierarchy if necessary.

The authors compare the performance of NICE against the NARADA protocol (which was possibly the best in that time) along the following dimensions: quality of data path (using stretch and stress measures), recovery from host failure, and control traffic overhead.  NICE is similar to NARADA in terms of the stretch of data paths and failure recovery, but it has lower stress on routers and links. The most important result is that NICE has orders of magnitude lower control overhead for groups of size > 32, which makes it much more scalable. In addition to simulations, the paper also presents some smaller scale wide are experiments which are similar to the simulation results.

Critique

The idea presented in this paper is neat and straightforward, but at the same time obvious. By creating hierarchies, a lot of problems can be solved; but such solutions come at the expense fault tolerance. The authors just assume that the RP won’t fail or even if it fails it will recover quickly, but they avoid the details of what might happen if things go wrong. In short, there is no clear notion of reliability.

NICE is just a glorious overlay. The reasons why application layer multicast should work has also been found to be not-so-correct. Network operators do not like protocols that try to bypass them and mess with their traffic engineering. Eventually, they end up blocking or throttling those protocols. The same might happen to NICE, if anyone was using it. There is no evidence presented in the paper that discusses this aspect of NICE-underlay interaction.

Scalable Reliable Multicast

S. Floyd, V. Jacobson, S. McCanne, C-G Liu, L. Zhang, “A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing,” ACM SIGCOMM Conference, (August 1995). [PDF]

Summary

Unlike unicast transmissions where requirements for different reliable, sequential data delivery are fairly general, different multicast applications have different requirements for reliability. The authors argue that designing a one-size-fits-all multicast protocol – that can simultaneously meet the functionality, scalability and efficiency requirements of all applications – is not possible. In this paper, they propose Scalable Reliable Multicast (SRM), a framework for supporting diverse multicast requirements in different applications by integrating application semantics (similar to Application Level Framing (ALF)) and session semantics (similar to Light-Weight Sessions (LWS)).

In order to ensure reliability, some party must take responsibility for loss detection and recovery. In TCP, the sender is that responsible party. However, in a multicast scenario, the sender will face ACK explosion from all the receivers in the multicast domain. In addition, the sender will have to keep track of the changing set of active receivers, corresponding windows. RTT estimation algorithms as well as concepts like congestion window also require rethinking. The authors argue that receiver-based reliability, where responsibilities are distributed among many receivers instead of being piled up on the single sender , is a better solution in the multicast domain. Another design choice is to promote the use of naming in application data units (ADUs) instead of generic protocol-specific naming; ADUs ensure that application semantics are intact and provides a level of indirection that can be exploited for anonymity purposes.

SRM is discussed using a distributed whiteboard (wb) application. The whiteboard separates drawing into pages. Each member can create a page and any member can manipulate any page. Each member has a globally unique identifier, so does each page. Names and identifiers are persistent and refers to the same entity over time. The authors also assume the presence of IP multicast datagram delivery and do not separate senders from receivers. Also, wb has no requirement for ordered data delivery because the operations are idempotent and they can be performed in any order.

Whenever new data is generated by wb, it is multicast to the group. Each member of the group individually keep track of their losses (detected by gaps in sequence numbers) and retransmissions. Each member sends low-rate, periodic, session messages to the group announcing the highest sequence number they received. The same messages also carry timestamps that are used to estimate distances between members. When receiver(s) detects missing data, they wait for a random amount of time (using a request timer) before sending a repair request. The wait time is a function of their distance from the source. Repair requests and retransmissions are multicast to the whole group, which suppress repair request implosion (from other receivers who also didn’t receive the same data). If a host receives a request for the same data before its request timer goes off, it resets the request timer after doing an exponential backoff. Any host with the requested data can reply back, but it also waits a random amount of time (using a repair timer) based on its distance from the requester. Similar to the previous case, this prevents response implosion. If a host receives the repair before its repair timer goes off, it cancels the timer.

The authors have analyzed the framework using three example topologies: chains, which explain deterministic suppression; stars, which explore probabilistic suppression; and bounded degree trees, which combine both the suppression mechanisms. Based on the analysis, they have come up with some parameters and observations that are best described in the paper. In short, by tuning those parameters the framework can be tuned to different application requirements.

Finally, SRM is extended to support adaptive adjustment of random timer algorithms based on the past behavior of request/repair algorithms. Simulations show that the adaptive algorithms quickly reduce the average number of repairs with little penalty in additional delay, and they are effective in controlling the number of duplicates over a range of scenarios.

Comments

The ideas presented in the paper are simple but intuitively effective. The best part would be the analysis of the framework using three different example topologies. The biggest complain about this paper, as well as about some others around the same time period, would be the absence of any axis or legends whatsoever in the graphs, which is very very annoying for the readers. Another observation is the extensive use of simulations.

There seems to be a lot of similarities between SRM and opportunistic routing protocols in the wireless networks. I wonder whether and which aspects of the opportunistic protocols can be used in a multicast scenario similar to SRM or vice versa.

X-Trace: A Pervasive Network Tracing Framework

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, I. Stoica, “X-Trace: A Pervasive Network Tracing Framework,” NSDI’07, (April 2007). [PDF]

Summary

Diagnosing problems in modern distributed and networked systems is notoriously hard due to the involvement of multiple administrative domains, application and network protocols, and lack of determinism and reproducibility. X-Trace is a cross-layer, cross-application tracing framework designed to reconstruct the behavior and workflow of a system with an aim to understand the causal paths in network protocols.

Design Principles

X-Trace is guided by the following three design principles:

  1. The trace request is sent in-band to capture the actual condition of the datapath.
  2. The collected trace data is sent out-of-band to avoid network failure and other errors, which the trace is actually reporting back.
  3. The tracing requester entity is decoupled from the tracing receiver entities to provide privacy to different administrative domains (AD) on the datapath and to provide them with the flexibility to decide whether or not to share tracing information with other domains.

Basic Workflow

A user or operator (transparently or non-transparently) invokes X-Trace when initiating an application task, by inserting X-Trace metadata with a unique task identifier in the resulting request. X-Trace metadata information is replicated into each layer and contains the task identifier, an optional TreeInfo:(ParentID, OpID, EdgeType) to later reconstruct the overall task-tree, and another optional destination field. This metadata is propagated down to the lower layers of the network stack (from application to network layer, link layer can also be supported using shim header) using the pushdown() primitive as well as recursively among subsequent requests in different nodes resulting from the original task using the pushnext() primitive. Eventually, for each task, all relevant nodes and layers will carry X-Trace tracing information.

When a node sees X-Trace metadata in a message at its particular layer, it generates a report, which is later used for reconstruction of the datapath. Reports are identified by the relevant TaskID and stored in a local database within an AD. As a result, traces regarding each segment of the datapath are stored in respective ADs. To be able to build the complete end-to-end tracing information and resulting task-tree using all the reports, one must be able to collect all the segments of traces.

Use Cases

This paper presents experiences of using X-Trace on three different scenarios: a simple web request and accompanying recursive DNS queries, a web hosting site, and an overlay network (i3). In all three cases the authors show that X-Trace can build the actual data-, execution-, and propagation-path and represent them in a task-tree (in case of no errors). Whenever there is something wrong – in any layer of any node – X-Trace can pinpoint that. The whole thing is more precise than other systems because X-Trace works with individual components to much deeper levels, and thus have more information to take any decision on what went wrong unlike its counterparts who work on larger components and fail to diagnose tiny details.

Discussion

The overall idea of the paper is pretty straightforward, but implementing/executing it is a very complicated and a huge undertaking. The authors have done a great job in building such a large framework and making it work.

However, there are too many concerns regarding its deployability, complexity, performance, overhead, security, and privacy issues. One great thing is that the authors are aware of the problems, and they discussed each one of them in the paper and proposed workarounds for some of them to give the impression that X-Trace might fall into the category “eventually deployable” instead of “trophy project“.

NetFPGA: A Tool for Network Research and Education

G. Watson, N. McKeown, M. Casado, “NetFPGA: A Tool for Network Research and Education,” 2nd Workshop on Architecture Research using FPGA Platforms (WARFP), (February, 2006). [PDF]

Summary

Building custom hardware systems for conducting network experiments as well as for educational purposes is complex and expensive. On the other end of the spectrum, software routers like Click provide easy development interface but fail to achieve hardware-comparable performance. NetFPGA takes the middle ground by providing an easy-programmable hardware platform for developing network devices (e.g., NICs, switches, routers etc.) with reasonable performance and flexibility.

User interaction with NetFPGA was initially designed to be remote; users would design, simulate, and debug their intended hardware logic using Verilog, upload them using a simple web interface to configure FPGAs on remote devices, and then download and observe the results. The initial configuration of NetFPGA, designed on PCB boards, did not have on-board CPU; controller software would run on remote computers connected through the Internet. All communication with the board was via the Ethernet ports. Later NetFPGA was redesigned using PCI boards and an on-board CPU was added.

The authors also mention use of NetFPGA in designing RCP-capable routers and  in implementing shunting to enable high-speed layer 7 intrusion detection and prevention system (IDS) by providing a controllable bridge which the IDS can use to manage the traffic.

A Policy-aware Switching Layer for Data Centers

D. Joseph, A. Tavakoli, I. Stoica, “A Policy-Aware Switching Layer for Data Centers,” ACM SIGCOMM Conference, (August 2008). [PDF]

Summary

Middlebox (e.g., firewalls, load balancers etc.) management in existing data center networks is inflexible and require manual configuration by system administrators, which result in frequent misconfigurations. This paper presents a policy-aware switching layer (PLayer) in layer 2 that addresses data center middlebox deployment hardships using two design principles: “separate policy from reachability” and “take middleboxes off the physical paths” to provide correctness of traffic flow, flexibility of middlebox placement/configuration, and efficiency in minimizing unnecessary middlebox traversals by packets.

Basic Workflow

The PLayer consists of enhanced layer 2 switches (pswitches). Unmodified middleboxes are plugged into a pswitch in the same way servers are plugged into regular switches. Pswitches depend on network administrator specified policies to forward frames. Policies define the sequence of middleboxes to be traversed by different traffic and take the form: [Start location, Traffic Selector]→Sequence, where the left side defines the applicable traffic (frames using 5-tuples matching the Traffic Selector arriving from the Start Location) and the right side refer to the sequence of middlebox types to be traversed. Note that, multiple middleboxes of the same type is not supported by PLayer; hence middlebox type.

Policies are centrally stored in a policy controller and automatically converted to rules by PLayer after static analysis to discard detectably buggy policies. Rules are stored in pswitches’ rule tables and take the form: [Previous Hop, Traffic Selector] : Next Hop. The centralized middlebox controller also monitor middlebox liveliness and informs pswitches about middlebox churn. Whenever a pswitch receives a packet, it finds the appropriate rule from the rule table and forward the packet to the Next Hop as specified. Pswitches use flow-direction-agnostic consistent hash on a frame’s 5-tuple to select the same middlebox instance for all the frames in both forward and reverse directions of a flow.

Implementation Details

  • Forwarding infrastructure: A pswitch consists of two independent, decoupled parts: the Switch Core, which does the forwarding, and the Policy Core, which redirects frames to the middleboxes dictated by policies. The Policy Core itself consists of multiple modules: RuleTable, InP, OutP, and FailDetect. The Switch Core appears as a regular Ethernet switch to the Policy Core, while the Policy Core appears like a multi-interface device to the Switch core. Frames redirected by the Policy Core are encapsulated inside a Ethernet-II frame to preserve the original MAC address, which is required for correctness.
  • Supporting unmodified middleboxes: To support middlebox without any modification, pswitches
    1. ensure that only relevant frames in standard Ethernet format reach middleboxes and servers,
    2. use non-intrusive techniques (e.g., stateless in-line device in the worst case) to identify the previous hop, and
    3. support a wide variety of middlebox addressing requirements.
  • Supporting non-transparent middleboxes/Stateful pswitches: In order to support middleboxes that change frame headers or content (e.g., load balancers), the concept of per-segment policies is introduced. Middlebox instance selection is supported by consistent hashing and policy hints, and require stateful pswitches in more complex scenarios. A stateful pswitch keeps track of forward and reverse flows to select the same middleboxes in both directions even if frame header is modified.
  • Churn behavior: PLayer can face different kinds of churn: network, policy, and middlebox. PLayer employs 2-stage, versioned policy and middlebox information dissemination mechanism, intermediate middlebox types, and stateful pswitches to handle different scenarios. The authors show that no matter what, policies are never violated.

Performance Evaluation and Analysis

PLayer was evaluated on DETER testbed using a handful of regular machines, and the evaluation mostly focus on showing that it works. However, such small scale experiment fail to capture or judge its effectiveness in large data center scenarios. Microbenchmarks show 18% decrease in TCP throughput and 16% increase in latency for basic pswitches and 60% drop in throughput and 100% increase in latency for middlebox offloading. Most the time is spent on rule lookup and frame encapsulation/decapsulation procedures.

The authors have also provided a semi-formal analysis of PLayer to show guaranteed correctness and consistency at the expense of availability.

Critique

While interesting this paper suffers from severe performance setbacks; specially, the high increase in latency will most likely make deployment impossible. Even though the authors argue that inside data centers latency introduced by offloading would be negligible, without real data it is really hard to rely on that assurance.

As the authors point out, there are several limitations in terms of indirect paths, policy specification mechanisms, effects of packet misclassification, incorrect physical wiring of middleboxes, unsupported policies, and more complex switches. However, the prospect of manageability gains is promising enough to consider those challenges in the future.

Several special cases rely on stateful pswitches, which are not very practical at least in the existing networking landscape.

One last thing, there are a lot of similarities between PLayer and OpenFlow from a high level view. I wonder whether OpenFlow (which appeared after PLayer) was influenced by PLayer and how much.

Internet Indirection Infrastructure

I. Stoica, D. Adkins, S. Zhuang, S. Shenker, S. Surana, “Internet Indirection Infrastructure,” ACM SIGCOMM Conference, (August 2002). [PDF]

Summary

Internet Indirection Infrastructure (i3) is an overlay network that offers a rendezvous-based communication abstraction to decouple the act of sending from the act of receiving in order to support diverse communication services (e.g., multicast, anycast, mobility etc.). Instead of directly sending packets to their destinations using the IP network, senders in i3 send packets to intermediate i3 nodes (using logical identifiers); receivers interested in receiving packets from particular identifiers must explicitly express their interests by inserting triggers. In their simplest forms, packets are pairs (id, data) where id is an m-bit identifier and data consists of a payload, and triggers are pairs (id, addr), where id is the trigger identifier and addr is the receiver’s address.

Extensions to the Basic Model

  1. Inexact identifier matching: Instead of exact-matching all m-bits to find appropriate triggers for a packet, an exact-match threshold k (k < m) is defined; a match is considered to exist when a trigger identifier matches a packet identifier in at least k-bits and that is the also the longest prefix match. By changing/exploiting k, i3 provides increased flexibility to support anycast, locality- and capacity-aware i3 server selection, and hot-spot avoidance. To facilitate inexact matching, all id‘s agreeing in their first k bits must reside on the same server.
  2. Stack of identifiers: Instead of using scalar identifiers, the authors propose using identifier stacks, which are vectors of identifiers or addresses. In this case, packets carry (idstack, data) and triggers are represented by (id, idstack). This generalized form allows a source to send a packet to a series of identifiers (similar to source routing). Similarly, a trigger can send a packet to another intermediate identifier instead of directly sending it to the receiver. A packet is forwarded over the i3 overlay until one of its identifiers in the idstack matches a trigger id; in that case that matching identifier is replaced by the trigger’s idstack and the process goes on. Unmatched identifiers are popped off without any change to the packet’s idstack, and if no trigger can be found for a packet, it is eventually dropped.
  3. Private triggers: i3 makes distinction between public triggers, which are known to everyone for end-to-end communication and private triggers, which are temporary triggers between two end points to carry on secure communication. Private triggers are established using public triggers.
  4. Potpourri: In addition to aforementioned extensions/generalizations, the authors also propose hierarchy of triggers to support large scale multicast, backup triggers for robustness, caching and randomization of trigger identifiers for routing efficiency.

Implementation Issues

The performance of i3 depends on the underlying overlay network. The authors implemented i3 on top of Chord for its robustness, scalability, efficiency of lookup, stability, and self-organization capability. Each i3 node/server is assigned a unique identifier in the Chord ring, and Chord lookups are used to find identifiers on a packet’s identifier stack while forwarding it. How to efficiently exact- or inexact- match triggers is not very well-defined.

The authors have spent some time in discussing the security concerns (e.g., eavesdropping, trigger hijacking, DoS attacks etc.) in using i3. Solutions are mostly simple in theory, but expected to lower performance.

Evaluation

Latency stretch (the ratio of the inter-node latency using i3 to the inter-node latency using IP) is used as the main performance metric in evaluating i3. Simulations show i3 not to perform as good as the IP network due to random placement of servers. Caching and heuristics for proximity placement significantly improve that performance. Overheads of trigger insertion and packet forwarding are found to be manageable, albeit not too impressive.

Critique

The simple abstraction mechanism presented in this paper manages to generalize multiple communication patterns that plague the current Internet. In addition, i3 is a great hash of many existing ideas, which IMO is a good thing for any systems or networking paper. However, i3’s performance and security aspects are not so simple, and even with the proposed increasingly complex tricks they still might fall short in real-life scenarios.

The evaluation section is probably the weakest part of the paper. Comparison against basic multicast seems to be a big missing thing. In addition, many of the clever stuff was not or could not be implemented/simulated, which fails to remove doubts about their applicability.

DNS Performance and the Effectiveness of Caching

J. Jung, E. Sit, H. Balakrishnan, “DNS Performance and the Effectiveness of Caching,” IEEE/ACM Transactions on Networking, V. 10, N. 5, (October 2002). [PDF]

Summary

Hierarchical design and aggressive caching are considered to be the main two reasons behind the scalability of DNS. Both factors seek to reduce the load on the root servers at the top of the name space hierarchy; in addition, caching tries to limit client-perceived delays and wide-area network bandwidth usage. This paper analyzes three network traces collected from two different sites during different time periods to study the effectiveness of these factors.

The Data

This study is based on three separate traces (two from MIT and one from KAIST). The authors collected both the DNS traffic and its driving workload, i.e., outgoing DNS queries and corresponding incoming answers (within 60s window) as well as outgoing TCP connections’ SYN, FIN, and RST packets. DNS lookups related to outgoing non-TCP traffic as well as lookups resulting from incoming TCP traffic are excluded, which amount for 5% and 10% of overall lookups, respectively and considered to be negligible.

Observations

By analyzing the collected data and performing trace-based simulation, the authors had following observations:

  • Distribution of popular names follow the Zipf’s law, meaning a small fraction of the names are queried large fractions of the time. As a result, popular names can be cached efficiently with smaller TTL values, whereas unpopular names fail to take advantage of caching with larger TTL values.
  • Sharing cache among groups of clients has limited gain in terms of cache hit after the total member count crosses 20-25. This is also due to the Zipfian distribution of name popularity and TTL.
  • The authors found that the client-perceived latency is adversely affected by number of referrals, and caching NS records to reduce the number of referrals will decrease latency as well as load on the root servers.
  • Distribution of names causing negative responses follows a heavy tailed distribution as well. As a result, hit rate of negative caching is also limited.
  • Many DNS servers are too persistent in retrying and overload the network. 63% of the traced DNS query packets were generated by lookups that obtained no answers.

Critique

The authors have done a great job in collecting and analyzing a large amount of data to provide some insights counter-intuitive to common beliefs. However, there are some issues regarding their collected data. It is not sure why they completely ignored UDP traffic. Several observations are made based on how things have changed in 12 months from the two MIT traces, which is effectively drawing conclusions from just two data points. Another thing is that they saw something and sort of explained that to put forth some conclusions counter-intuitive to previous observations by other fellows; however, it is not sure whether they would have seen something common, had they used other people’s data. In short, the collected data might not have been varied enough to draw conclusions about the whole Internet.

Development of the Domain Name System

P. Mockapetris, K. Dunlap, “Development of the Domain Name System,” ACM SIGCOMM Conference, 1988. [ACM]

Summary

The Domain Name System (DNS) was first proposed (circa 1982) to address the problems faced by the then existing HOSTS.TXT-based “host name to address” translation mechanism, which was suffering from scalability and distribution issues along with other problems common to any centralized system. This paper (circa 1988) gives an overview of the experiences in using DNS since its inception.

DNS Design

DNS design was motivated by several requirements: scalability, performance, decentralization, interoperability, and support for various topologies and OSes. Based on these goals, DNS was designed as a hierarchical name space to create a distributed and lean database that primarily maps names to addresses and performs some other mapping tasks. The internal name space is a variable-depth tree, where each node in the tree has an associated label. The domain name of a node is the concatenation of all the labels. Data for each name in DNS is organized as a set of resource records (RRs), each carrying a pre-defined type and class field, followed by applications data.

The DNS tree is divided into zones controlled by different organizations. Each zone can be a contiguous section of the tree, even though it is typically a simple subtree. Administrative authority flows hierarchically from top-to-bottom among zones. The active components of DNS are name servers, which store information and answer queries, and resolvers, which interface to client programs and find appropriate name servers. A particular name server can support any number of zones which may or may not be contagious and the name server need not be part of that zone. Answers originating from zone data are marked as being authoritative. DNS resolvers and combined name server/resolver programs also cache responses for later use. Each RR contains a TTL field that controls how long it can be cached.

Experiences

  • At the time of writing this paper, most queries could be classified into four categories: all information (25-40%), host name to address mappings (30-40%), address to host mappings (10-15%), and MX lookups (10%).
  • Response times experienced by clients varied from 500ms to 5s.
  • The authors also observed that negative responses was around 10-50% and hence, should also be cached similar to positive responses.
  • TTLs are often misused and misinterpreted by the system administrators.

Comments

This paper discusses the basics of DNS and some general system design/development principles. The most important thing to notice in this paper is the lack of concern about security (which is similar to many other papers from that era). Otherwise, despite many problems, DNS is sort of successful since it has managed to survive the boom of Internet sites and users, and still working.