All posts by Mosharaf

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

November 29, 2009 Mosharaf Leave a comment

S. Nedevschi, J. Chandrashekar, J. Liu, B. Nordman, S. Ratnasamy, N. Taft, “Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems,” NSDI’09, (April 2009). [PDF]

Summary

This paper argues that putting networked end-systems into low-power sleep modes, instead of keeping them in higher-power-consuming idle states, can result in significant energy savings. However, a sleeping device loses its network presence and can prevent running scheduled tasks. There are two main approaches that address these issues: first, wake-on-lan (WoL) mechanisms wake machines up at the arrival of specific packets; and second, the use of a proxy that handles packets for the sleeping machine unless it is absolutely necessary to wake the machine up. The authors study the pros and cons of the two approaches based on data collected from 250 enterprise machines and present a proxy architecture with a narrow API interface to support different idle states.

From their measurement data, the authors found that the machines under study are active for only 10% of the time and on the average 50% of the time they are idle; only a small fraction are put to sleep at all. This observation suggests that there is enormous opportunity of saving energy by exploiting the sleep states. The authors also notice that there is always a steady flow of packets, which makes WoP (wake-on-packet) an infeasible choice.

In the proxy-based solution, packets destined for a sleeping host are intercepted by its proxy. The proxy decides whether to ignore/drop the packet or to respond somehow or to wake up the machine it is representing. To make judicious decisions, the authors identified different classes of packets (e.g., incoming/outgoing, unicast/broadcast/multicast) and found that both broadcast and multicast are largely responsible for poor sleep (80% more sleep time in home environment and 50% more in office environment). They deconstructed different classes of traffic and ended up with some ground rules about what to do when a proxy sees different types of packets.

Idle-time traffic have been differentiated along two different dimensions. The first classifies traffic based on the need to proxy the traffic (protocol) in question into three categories (don’t-wake protocols, don’t ignore protocols, and policy-dependent protocols). The second one identifies the complexity of decision making (ignorable(drop), handled via mechanical response, and require specialized processing). The end result is a firewall like architecture, where the power-proxy table consists of a list of rules with each rule consisting of a <trigger, action, timeout> tuple. For incoming packets, the proxy matches the rules and performs necessary actions. The authors also built a simple Click-based prototype of their proposed solution.

Comments

This is yet another very good measurement paper. Specially, the study of various types of traffic in different network environment is an eye-opener.

The solution presented in the paper is very much similar to firewall; so an intelligent guess would be that a lot of firewall literature might come in handy in solving the complexities of matching rules.

Reviews

Cutting the Electric Bill for Internet-Scale Systems

November 29, 2009 Mosharaf Leave a comment

A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, B. Maggs, “Cutting the Electric Bill for Internet-Scale Systems,” ACM SIGCOMM Conference, (August 2009). [PDF]

Summary

Large organizations like Google, Microsoft, and Yahoo! annually consume tens of millions of dollars worth of electricity. The traditional approach toward reducing energy costs by reducing the amount of energy consumption has not been that successful. The authors of this paper propose a new method based on two key observations: first, electricity prices exhibit both temporal and geographical variation; and second, large distributed systems incorporate request routing and replication. Based on these observations, they posit that cost-aware routing of computations to low price electricity zones can save a significant amount of energy expenses per annum without increasing other costs (e.g., bandwidth). It should be noted that the paper deals with reducing energy cost and not energy consumption.

Electricity is generally produced somewhere else and then transmitted to consumer locations. The whole process is regulated by Regional Transmission Organizations or RTOs (there are eight reliability regions in the US). RTOs use auctioning mechanisms to match buyers with sellers in multiple parallel wholesale markets such as day-ahead, hour-ahead, and real-time markets. This paper is concerned only with the real-time market and depends on its variation over time and locality. Through empirical study of average prices from January 2006 to April 2009, the authors have found significant uncorrelated variation of prices across time, geographic locations, and different wholesale markets. These variations are due to the source of electricity and time difference in regions (resulting in peak hours in different absolute times).

Routing computations to different, possibly further, geographical places can result in increased latency to client experience and an increase in bandwidth cost. Based on their study on Akamai data, the authors found that for Akamai-like large corporations such problems are manageable, i.e., they do have some impact but the total operating cost still decreases. In the end, they provide a simple strategy to reduce energy costs by moving computations to minimum priced places within a certain radius from the original location.

The results presented in this paper are highly dependent on the energy-proportionality of data centers/clusters, which refers to the fact that energy consumption should be proportional to the load. Unfortunately, that is not the case in today’s hardware, but the authors hope for something like that in the future. Anyhow, they did find significant temporal and geographic variation in prices to exploit and through trace-based simulation they showed that 40% savings can be achieved in ideal energy-proportional scenario. However, on real hardware total savings is brought down to only ~5-6%.

Critique

The authors have presented an excellent overview of the contemporary electricity economy, and they have obviously done a great job in shedding new light on a (sort of) well-established problem.

While the paper is excellent in its observations and measurements, the solution is straightforward with several simplifying assumptions. One can imagine a multi-variable optimization problem here that will consider energy-, bandwidth-, and other costs together with multiple constraints on latencies, bandwidth etc. On the other hand, such optimization problems are really hard to solve in real-time, and it might happen so that people will end up using the straightforward approach in the end.

The choice of 1500km as the radius seemed pretty arbitrary. The authors did try to justify the number by some significant jumps in cost and distance at that value, but I did not find it very convincing.

Also, one might ask if such measures do save some cost, why no one is using it. The possible reason is that the overhead of practically implementing something like this outweighs its savings. But this is only the first step in this direction, there might be ways to find some practical solution in the future.

Also, it seemed to me that the paper does not have much networking or communication content per se. Anyhow, it was an interesting read.

Reviews

BotGraph: Large Scale Spamming Botnet Detection

November 24, 2009 Mosharaf Leave a comment

Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, E. Gillum, “BotGraph: Large Scale Spamming Botnet Detection,” NSDI’09, (April 2009). [PDF]

Summary

Analyzing large volume of logged data to identify abnormal patterns is one of the biggest and most frequently faced challenges in the network security community. This paper presents BotGraph, a mechanism based on random graph theory, to detect botnet spamming attacks on major web email providers based on millions of log entries and describes its implementation as a data parallel system. The authors posit that even though individually detecting bot-users (false email accounts/users created by botnets) is difficult, they do have some aggregate behavior (e.g., they share IP addresses when they log in and send emails). BotGraph detects abnormal sharing of IP addresses among bot-users to separate them from human-users.

BotGraph has two major components: aggressive sign-up detection that identifies sudden increase of signup activities from the same IP address and stealthy bot detection that detects sharing of one IP address by many bot-users as well as sharing of many IP addresses by a single bot-user by creating a user-user graph. In this paper, the authors consider multiple IP addresses from the same AS as one shared IP address. It is also assumed that legitimate groups of normal users normally do not use the same set of IP addresses from different ASes.

In order to identify bot-user groups, the authors create a user-user graph where each user is a vertex and each link between two vertices carry some weight based on a similarity metric of the two vertices. The authors’ assumption is that the bot-users will create a giant connected component in the BotGraph (since they will share IP addresses) that will collectively distinguish them from the normal users (who create much smaller connected components). They show that there exists some threshold on edge weights, which, if decreased, will suddenly result in large components. It is proven, using random graph theory, that if there is IP address sharing than the giant component will be seen with a high probability.

The detection algorithm works in an iterated manner starting with a smaller threshold value to create a large component, and it then recursively increases the threshold to extract connected sub-components (until the size of the connected components become smaller than some constant). The resultant output is a hierarchical tree. The algorithm then uses some statistical measurements (using histograms) to separate bot-user group components from their real counterparts (this part of the algorithm is questionable!). After the pruning stage, BotGraph goes through another phase of traversing the hierarchical tree to consolidate bot-user group information.

The problem in applying BotGraph arrives first as the complexity arising from the construction of the graph, which can contain millions of vertices, and then in terms of processing the huge graph to actually run the algorithm for detection. The authors propose two approaches using currently popular large data parallel processing mechanism MapReduce and its extension using selective filtering with two more interfaces than just basic Map and Reduce. The extended version is found to be the more resource efficient, scalable, and faster of the two.

Evaluation using the data parallel systems show that BotGraph could identify a large number of previously unknown bot-users and bot-user accounts with low false positive rates.

Critique

The whole concept seems to be standing on the shoulder of the assumption that botnet operators are “stupid”. The criteria used in the paper are pretty straightforward, and it shouldn’t be too hard for the botnet operators to figure them out and work around them.

The MapReduce part in the middle of the paper, which took 4 pages, seems not SO new; they could’ve just left it out of the paper or at least shorten that part and save everyone some time. After all, MapReduce was created to take care of Tera bytes of data; making it work for 500 GB data should not be that awe-inspiring. To give the authors credit, it should be mentioned that most probably they are the first ones to use it in this context.

One more thing, I wish someone could publish what GMail does to filter spams because I can bet they are pretty darn good at what they are doing and much ahead of other web mail providers.

Reviews

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

November 23, 2009 Mosharaf Leave a comment

R. Gummadi, H. Balakrishnan, P. Maniatis, S. Ratnasamy, “Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks,” NSDI’09, (April 2009). [PDF]

Summary

In recent years, botnets have become the major originators of email spams, DDoS attacks, and click-frauds on advertisement-based web sites. This paper argues that separating human-generated traffic from botnet-generated activities can improve reliability of various web-based services against botnet attacks. But identifying human-generated traffic in the absence of strong unique identities can be challenging. In this paper, the authors propose NAB (Not-A-Bot), a system to approximately identify and certify human-generated activity in a non-intrusive way.

NAB Architecture

NAB consists of an attester, a small trusted software, that runs locally at a host (isolated from the untrusted OS) and generates attestations corresponding to a request from an application, as well as an external verifier that validates these attestations in a distributed site. There are four main requirements that drive the NAB architecture (attester and verifier) design:

Attestations must be generated in response to human requests automatically.
Attestations must not be transferable from the client on which they are generated to attest traffic originating from another client.
NAB must benefit users that deploy it without hurting those that do not.
NAB must preserve the existing privacy and anonymity semantics of applications

Requirements 1 and 2 are implemented/met in the attester and requirements 3 and 4 are ensured in the verifier.

Attester

The attester runs on a trusted computing base (TCB), which is implemented by taking advantage of the Trusted Platform Module (TPM) available is most modern systems. The authors use TPM to create a trusted path between physical input devices and the human activity attester.

The attester’s sole purpose is to create attestations – when asked for by an application – for legitimate human activity. The authors used a simple t-δ attester, where a attestation is created if there is any input activity in last δ time units. Even though there is a possibility of forging/harvesting user activity in this simpler approach, the authors argue that the botnet will be limited by human activity frequency, which will decrease the number of attacks.

NAB generates responder-specific, content-specific, and if appropriate, challenger-specific attestations and employs existing cryptographic methods to secure them. It also ensures that attestations cannot be double-spent and cannot be misused by botnets (for a very limited time window botnets can forge attestations)

Verifier

Verifier is co-located with the server processing requests. When invoked, the verifier is passed both the attestation and the request. Based on these information (plus the crypto-thing that are in the paper), the verifier checks the validity of the attestation. The authors also discussed different application-specific spam, DDoS, and click-fraud verification policies.

Evaluation Results

Major results from the evaluation are:

TCB size can be really small (500 SLOC)
Attester CPU cost is 10⁷ instructions/attestations
For simple application changes less than 250 SLOC changes is enough to enable them of NAB
In the worst case, NAB can suppress 92% spam, 89% non-human possibly DDoS activity, and 87% automated clicks without false positives
The verifier can withstand 100,000 bot DDoS and can handle more than 1000 requests/second

Discussion

The authors have a done good job in discussing several alternatives in multiple cases instead of sticking to their chosen one, which provides insights into botnet behavior and explains why some solutions might not work. There solution still has some loopholes, but possibly will work to some extent. The requirement of changing all the applications could be a turn off. However, the fact that NAB does not discriminate against unattested traffic makes that problem go away and allows incremental deployment.

One problem that I can think of right now regarding this approach is that the applications are not in the TCB and the requests for attestations do not seem to be going through a trusted channel. What are the fallback options if a malicious entity corrupts/removes attestations just after they leave the trusted base of the attester? Also, the verifiers do not seem to be in a TCB. What if it is compromised?

Reviews

Scalable Application Layer Multicast

November 22, 2009 Mosharaf Leave a comment

S. Banerjee, B. Bhattacharjee, C. Kommareddy, “Scalable Application Layer Multicast,” ACM SIGCOMM Conference, (August 2002). [PDF]

Summary

Deployment of network layer multicast requires support from the underlying infrastructure. Like many other proposals/protocols with similar requirements, it faced the same fate: no one deployed it. This paper presents a scalable application layer multicast protocol, NICE, designed for low-bandwidth data streaming applications with large receiver sets. Data packets in NICE are replicated at end hosts, which makes it technically an overlay, and so it does not need support from the network providers.

The NICE protocol arranges the set of end hosts into a hierarchy and maintains the hierarchy through different events. The members at the very top of the hierarchy maintain (soft) state about O(log N) other members. Logically each member keeps detailed state about other members that are near (calculated based on end-to-end latency) in the hierarchy and only has limited knowledge about other members in the group. Hosts in each layer are partitioned into a set of clusters of sizes between k and 3k-1, where k is a constant. Each cluster has a leader that is the graph-theoretic center of the cluster. The leaders in layer L_i are the members of the layer L_i+1, and the same process continues. The following properties hold for the distribution of hosts in different layers:

A host belongs to a single cluster at any layer.
If a host is present in layer L_i, it must be the cluster leader in each of the lower layers (L_i-1…L₀).
If a host is not present in layer L_i, it cannot be present in any higher layers (L_i+1…).
Each cluster is size bounded between k and 3k-1.
There are atmost log_k N layers, and the highest layer has only a single member.

There are separate control and data planes in NICE. The neighbors on the control topology exchange periodic soft state refreshes and form a clique in each layer. The data topology in each layer is a source-specific tree.

NICE assumes a special host, rendezvous point (RP), which is known a-priori to all members. When a new host wants to join a multicast group, it contacts the RP. The RP responds by returning the members in the highest layer. The joining node then contacts the closest member, which refers to the members of its cluster in the lower layer (the closest member is here because it is a leader in the lower layer). This process continues iteratively until the joining node finds its appropriate location in the lowest layer. The message overhead of the joining process is O(k log N) query-response pairs. When a node leaves, it removes itself from all the layers and new leaders are chosen. Periodic heartbeat messages are used to keep the state information fresh and refine (split/merge) the hierarchy if necessary.

The authors compare the performance of NICE against the NARADA protocol (which was possibly the best in that time) along the following dimensions: quality of data path (using stretch and stress measures), recovery from host failure, and control traffic overhead. NICE is similar to NARADA in terms of the stretch of data paths and failure recovery, but it has lower stress on routers and links. The most important result is that NICE has orders of magnitude lower control overhead for groups of size > 32, which makes it much more scalable. In addition to simulations, the paper also presents some smaller scale wide are experiments which are similar to the simulation results.

Critique

The idea presented in this paper is neat and straightforward, but at the same time obvious. By creating hierarchies, a lot of problems can be solved; but such solutions come at the expense fault tolerance. The authors just assume that the RP won’t fail or even if it fails it will recover quickly, but they avoid the details of what might happen if things go wrong. In short, there is no clear notion of reliability.

NICE is just a glorious overlay. The reasons why application layer multicast should work has also been found to be not-so-correct. Network operators do not like protocols that try to bypass them and mess with their traffic engineering. Eventually, they end up blocking or throttling those protocols. The same might happen to NICE, if anyone was using it. There is no evidence presented in the paper that discusses this aspect of NICE-underlay interaction.

Reviews

Scalable Reliable Multicast

November 22, 2009 Mosharaf Leave a comment

S. Floyd, V. Jacobson, S. McCanne, C-G Liu, L. Zhang, “A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing,” ACM SIGCOMM Conference, (August 1995). [PDF]

Summary

Unlike unicast transmissions where requirements for different reliable, sequential data delivery are fairly general, different multicast applications have different requirements for reliability. The authors argue that designing a one-size-fits-all multicast protocol – that can simultaneously meet the functionality, scalability and efficiency requirements of all applications – is not possible. In this paper, they propose Scalable Reliable Multicast (SRM), a framework for supporting diverse multicast requirements in different applications by integrating application semantics (similar to Application Level Framing (ALF)) and session semantics (similar to Light-Weight Sessions (LWS)).

In order to ensure reliability, some party must take responsibility for loss detection and recovery. In TCP, the sender is that responsible party. However, in a multicast scenario, the sender will face ACK explosion from all the receivers in the multicast domain. In addition, the sender will have to keep track of the changing set of active receivers, corresponding windows. RTT estimation algorithms as well as concepts like congestion window also require rethinking. The authors argue that receiver-based reliability, where responsibilities are distributed among many receivers instead of being piled up on the single sender , is a better solution in the multicast domain. Another design choice is to promote the use of naming in application data units (ADUs) instead of generic protocol-specific naming; ADUs ensure that application semantics are intact and provides a level of indirection that can be exploited for anonymity purposes.

SRM is discussed using a distributed whiteboard (wb) application. The whiteboard separates drawing into pages. Each member can create a page and any member can manipulate any page. Each member has a globally unique identifier, so does each page. Names and identifiers are persistent and refers to the same entity over time. The authors also assume the presence of IP multicast datagram delivery and do not separate senders from receivers. Also, wb has no requirement for ordered data delivery because the operations are idempotent and they can be performed in any order.

Whenever new data is generated by wb, it is multicast to the group. Each member of the group individually keep track of their losses (detected by gaps in sequence numbers) and retransmissions. Each member sends low-rate, periodic, session messages to the group announcing the highest sequence number they received. The same messages also carry timestamps that are used to estimate distances between members. When receiver(s) detects missing data, they wait for a random amount of time (using a request timer) before sending a repair request. The wait time is a function of their distance from the source. Repair requests and retransmissions are multicast to the whole group, which suppress repair request implosion (from other receivers who also didn’t receive the same data). If a host receives a request for the same data before its request timer goes off, it resets the request timer after doing an exponential backoff. Any host with the requested data can reply back, but it also waits a random amount of time (using a repair timer) based on its distance from the requester. Similar to the previous case, this prevents response implosion. If a host receives the repair before its repair timer goes off, it cancels the timer.

The authors have analyzed the framework using three example topologies: chains, which explain deterministic suppression; stars, which explore probabilistic suppression; and bounded degree trees, which combine both the suppression mechanisms. Based on the analysis, they have come up with some parameters and observations that are best described in the paper. In short, by tuning those parameters the framework can be tuned to different application requirements.

Finally, SRM is extended to support adaptive adjustment of random timer algorithms based on the past behavior of request/repair algorithms. Simulations show that the adaptive algorithms quickly reduce the average number of repairs with little penalty in additional delay, and they are effective in controlling the number of duplicates over a range of scenarios.

Comments

The ideas presented in the paper are simple but intuitively effective. The best part would be the analysis of the framework using three different example topologies. The biggest complain about this paper, as well as about some others around the same time period, would be the absence of any axis or legends whatsoever in the graphs, which is very very annoying for the readers. Another observation is the extensive use of simulations.

There seems to be a lot of similarities between SRM and opportunistic routing protocols in the wireless networks. I wonder whether and which aspects of the opportunistic protocols can be used in a multicast scenario similar to SRM or vice versa.

Reviews

X-Trace: A Pervasive Network Tracing Framework

November 11, 2009 Mosharaf 1 Comment

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, I. Stoica, “X-Trace: A Pervasive Network Tracing Framework,” NSDI’07, (April 2007). [PDF]

Summary

Diagnosing problems in modern distributed and networked systems is notoriously hard due to the involvement of multiple administrative domains, application and network protocols, and lack of determinism and reproducibility. X-Trace is a cross-layer, cross-application tracing framework designed to reconstruct the behavior and workflow of a system with an aim to understand the causal paths in network protocols.

Design Principles

X-Trace is guided by the following three design principles:

The trace request is sent in-band to capture the actual condition of the datapath.
The collected trace data is sent out-of-band to avoid network failure and other errors, which the trace is actually reporting back.
The tracing requester entity is decoupled from the tracing receiver entities to provide privacy to different administrative domains (AD) on the datapath and to provide them with the flexibility to decide whether or not to share tracing information with other domains.

Basic Workflow

A user or operator (transparently or non-transparently) invokes X-Trace when initiating an application task, by inserting X-Trace metadata with a unique task identifier in the resulting request. X-Trace metadata information is replicated into each layer and contains the task identifier, an optional TreeInfo:(ParentID, OpID, EdgeType) to later reconstruct the overall task-tree, and another optional destination field. This metadata is propagated down to the lower layers of the network stack (from application to network layer, link layer can also be supported using shim header) using the pushdown() primitive as well as recursively among subsequent requests in different nodes resulting from the original task using the pushnext() primitive. Eventually, for each task, all relevant nodes and layers will carry X-Trace tracing information.

When a node sees X-Trace metadata in a message at its particular layer, it generates a report, which is later used for reconstruction of the datapath. Reports are identified by the relevant TaskID and stored in a local database within an AD. As a result, traces regarding each segment of the datapath are stored in respective ADs. To be able to build the complete end-to-end tracing information and resulting task-tree using all the reports, one must be able to collect all the segments of traces.

Use Cases

This paper presents experiences of using X-Trace on three different scenarios: a simple web request and accompanying recursive DNS queries, a web hosting site, and an overlay network (i3). In all three cases the authors show that X-Trace can build the actual data-, execution-, and propagation-path and represent them in a task-tree (in case of no errors). Whenever there is something wrong – in any layer of any node – X-Trace can pinpoint that. The whole thing is more precise than other systems because X-Trace works with individual components to much deeper levels, and thus have more information to take any decision on what went wrong unlike its counterparts who work on larger components and fail to diagnose tiny details.

Discussion

The overall idea of the paper is pretty straightforward, but implementing/executing it is a very complicated and a huge undertaking. The authors have done a great job in building such a large framework and making it work.

However, there are too many concerns regarding its deployability, complexity, performance, overhead, security, and privacy issues. One great thing is that the authors are aware of the problems, and they discussed each one of them in the paper and proposed workarounds for some of them to give the impression that X-Trace might fall into the category “eventually deployable” instead of “trophy project“.

Reviews

NetFPGA: A Tool for Network Research and Education

November 11, 2009 Mosharaf 2 Comments

G. Watson, N. McKeown, M. Casado, “NetFPGA: A Tool for Network Research and Education,” 2nd Workshop on Architecture Research using FPGA Platforms (WARFP), (February, 2006). [PDF]

Summary

Building custom hardware systems for conducting network experiments as well as for educational purposes is complex and expensive. On the other end of the spectrum, software routers like Click provide easy development interface but fail to achieve hardware-comparable performance. NetFPGA takes the middle ground by providing an easy-programmable hardware platform for developing network devices (e.g., NICs, switches, routers etc.) with reasonable performance and flexibility.

User interaction with NetFPGA was initially designed to be remote; users would design, simulate, and debug their intended hardware logic using Verilog, upload them using a simple web interface to configure FPGAs on remote devices, and then download and observe the results. The initial configuration of NetFPGA, designed on PCB boards, did not have on-board CPU; controller software would run on remote computers connected through the Internet. All communication with the board was via the Ethernet ports. Later NetFPGA was redesigned using PCI boards and an on-board CPU was added.

The authors also mention use of NetFPGA in designing RCP-capable routers and in implementing shunting to enable high-speed layer 7 intrusion detection and prevention system (IDS) by providing a controllable bridge which the IDS can use to manage the traffic.

Reviews

A Policy-aware Switching Layer for Data Centers

November 10, 2009 Mosharaf Leave a comment

D. Joseph, A. Tavakoli, I. Stoica, “A Policy-Aware Switching Layer for Data Centers,” ACM SIGCOMM Conference, (August 2008). [PDF]

Summary

Middlebox (e.g., firewalls, load balancers etc.) management in existing data center networks is inflexible and require manual configuration by system administrators, which result in frequent misconfigurations. This paper presents a policy-aware switching layer (PLayer) in layer 2 that addresses data center middlebox deployment hardships using two design principles: “separate policy from reachability” and “take middleboxes off the physical paths” to provide correctness of traffic flow, flexibility of middlebox placement/configuration, and efficiency in minimizing unnecessary middlebox traversals by packets.

Basic Workflow

The PLayer consists of enhanced layer 2 switches (pswitches). Unmodified middleboxes are plugged into a pswitch in the same way servers are plugged into regular switches. Pswitches depend on network administrator specified policies to forward frames. Policies define the sequence of middleboxes to be traversed by different traffic and take the form: [Start location, Traffic Selector]→Sequence, where the left side defines the applicable traffic (frames using 5-tuples matching the Traffic Selector arriving from the Start Location) and the right side refer to the sequence of middlebox types to be traversed. Note that, multiple middleboxes of the same type is not supported by PLayer; hence middlebox type.

Policies are centrally stored in a policy controller and automatically converted to rules by PLayer after static analysis to discard detectably buggy policies. Rules are stored in pswitches’ rule tables and take the form: [Previous Hop, Traffic Selector] : Next Hop. The centralized middlebox controller also monitor middlebox liveliness and informs pswitches about middlebox churn. Whenever a pswitch receives a packet, it finds the appropriate rule from the rule table and forward the packet to the Next Hop as specified. Pswitches use flow-direction-agnostic consistent hash on a frame’s 5-tuple to select the same middlebox instance for all the frames in both forward and reverse directions of a flow.

Implementation Details

Forwarding infrastructure: A pswitch consists of two independent, decoupled parts: the Switch Core, which does the forwarding, and the Policy Core, which redirects frames to the middleboxes dictated by policies. The Policy Core itself consists of multiple modules: RuleTable, InP, OutP, and FailDetect. The Switch Core appears as a regular Ethernet switch to the Policy Core, while the Policy Core appears like a multi-interface device to the Switch core. Frames redirected by the Policy Core are encapsulated inside a Ethernet-II frame to preserve the original MAC address, which is required for correctness.
Supporting unmodified middleboxes: To support middlebox without any modification, pswitches
1. ensure that only relevant frames in standard Ethernet format reach middleboxes and servers,
2. use non-intrusive techniques (e.g., stateless in-line device in the worst case) to identify the previous hop, and
3. support a wide variety of middlebox addressing requirements.
Supporting non-transparent middleboxes/Stateful pswitches: In order to support middleboxes that change frame headers or content (e.g., load balancers), the concept of per-segment policies is introduced. Middlebox instance selection is supported by consistent hashing and policy hints, and require stateful pswitches in more complex scenarios. A stateful pswitch keeps track of forward and reverse flows to select the same middleboxes in both directions even if frame header is modified.
Churn behavior: PLayer can face different kinds of churn: network, policy, and middlebox. PLayer employs 2-stage, versioned policy and middlebox information dissemination mechanism, intermediate middlebox types, and stateful pswitches to handle different scenarios. The authors show that no matter what, policies are never violated.

Performance Evaluation and Analysis

PLayer was evaluated on DETER testbed using a handful of regular machines, and the evaluation mostly focus on showing that it works. However, such small scale experiment fail to capture or judge its effectiveness in large data center scenarios. Microbenchmarks show 18% decrease in TCP throughput and 16% increase in latency for basic pswitches and 60% drop in throughput and 100% increase in latency for middlebox offloading. Most the time is spent on rule lookup and frame encapsulation/decapsulation procedures.

The authors have also provided a semi-formal analysis of PLayer to show guaranteed correctness and consistency at the expense of availability.

Critique

While interesting this paper suffers from severe performance setbacks; specially, the high increase in latency will most likely make deployment impossible. Even though the authors argue that inside data centers latency introduced by offloading would be negligible, without real data it is really hard to rely on that assurance.

As the authors point out, there are several limitations in terms of indirect paths, policy specification mechanisms, effects of packet misclassification, incorrect physical wiring of middleboxes, unsupported policies, and more complex switches. However, the prospect of manageability gains is promising enough to consider those challenges in the future.

Several special cases rely on stateful pswitches, which are not very practical at least in the existing networking landscape.

One last thing, there are a lot of similarities between PLayer and OpenFlow from a high level view. I wonder whether OpenFlow (which appeared after PLayer) was influenced by PLayer and how much.

Reviews

Internet Indirection Infrastructure

November 9, 2009 Mosharaf 2 Comments

I. Stoica, D. Adkins, S. Zhuang, S. Shenker, S. Surana, “Internet Indirection Infrastructure,” ACM SIGCOMM Conference, (August 2002). [PDF]

Summary

Internet Indirection Infrastructure (i3) is an overlay network that offers a rendezvous-based communication abstraction to decouple the act of sending from the act of receiving in order to support diverse communication services (e.g., multicast, anycast, mobility etc.). Instead of directly sending packets to their destinations using the IP network, senders in i3 send packets to intermediate i3 nodes (using logical identifiers); receivers interested in receiving packets from particular identifiers must explicitly express their interests by inserting triggers. In their simplest forms, packets are pairs (id, data) where id is an m-bit identifier and data consists of a payload, and triggers are pairs (id, addr), where id is the trigger identifier and addr is the receiver’s address.

Extensions to the Basic Model

Inexact identifier matching: Instead of exact-matching all m-bits to find appropriate triggers for a packet, an exact-match threshold k (k < m) is defined; a match is considered to exist when a trigger identifier matches a packet identifier in at least k-bits and that is the also the longest prefix match. By changing/exploiting k, i3 provides increased flexibility to support anycast, locality- and capacity-aware i3 server selection, and hot-spot avoidance. To facilitate inexact matching, all id‘s agreeing in their first k bits must reside on the same server.
Stack of identifiers: Instead of using scalar identifiers, the authors propose using identifier stacks, which are vectors of identifiers or addresses. In this case, packets carry (id_stack, data) and triggers are represented by (id, id_stack). This generalized form allows a source to send a packet to a series of identifiers (similar to source routing). Similarly, a trigger can send a packet to another intermediate identifier instead of directly sending it to the receiver. A packet is forwarded over the i3 overlay until one of its identifiers in the id_stack matches a trigger id; in that case that matching identifier is replaced by the trigger’s id_stack and the process goes on. Unmatched identifiers are popped off without any change to the packet’s id_stack, and if no trigger can be found for a packet, it is eventually dropped.
Private triggers: i3 makes distinction between public triggers, which are known to everyone for end-to-end communication and private triggers, which are temporary triggers between two end points to carry on secure communication. Private triggers are established using public triggers.
Potpourri: In addition to aforementioned extensions/generalizations, the authors also propose hierarchy of triggers to support large scale multicast, backup triggers for robustness, caching and randomization of trigger identifiers for routing efficiency.

Implementation Issues

The performance of i3 depends on the underlying overlay network. The authors implemented i3 on top of Chord for its robustness, scalability, efficiency of lookup, stability, and self-organization capability. Each i3 node/server is assigned a unique identifier in the Chord ring, and Chord lookups are used to find identifiers on a packet’s identifier stack while forwarding it. How to efficiently exact- or inexact- match triggers is not very well-defined.

The authors have spent some time in discussing the security concerns (e.g., eavesdropping, trigger hijacking, DoS attacks etc.) in using i3. Solutions are mostly simple in theory, but expected to lower performance.

Evaluation

Latency stretch (the ratio of the inter-node latency using i3 to the inter-node latency using IP) is used as the main performance metric in evaluating i3. Simulations show i3 not to perform as good as the IP network due to random placement of servers. Caching and heuristics for proximity placement significantly improve that performance. Overheads of trigger insertion and packet forwarding are found to be manageable, albeit not too impressive.

Critique

The simple abstraction mechanism presented in this paper manages to generalize multiple communication patterns that plague the current Internet. In addition, i3 is a great hash of many existing ideas, which IMO is a good thing for any systems or networking paper. However, i3’s performance and security aspects are not so simple, and even with the proposed increasingly complex tricks they still might fall short in real-life scenarios.

The evaluation section is probably the weakest part of the paper. Comparison against basic multicast seems to be a big missing thing. In addition, many of the clever stuff was not or could not be implemented/simulated, which fails to remove doubts about their applicability.