“Surviving Failures in Bandwidth-Constrained Datacenters” at SIGCOMM’2012

Update: Camera-ready version is in my publications page!

My internship work from last Summer has been accepted for publication at SIGCOMM’2012 as well; yay!! In this piece of work, we try to allocate machines for datacenter applications with bandwidth and fault-tolerance constraints, which are at odds—allocation for bandwidth tries to put machines closer, whereas a fault-tolerant allocation spreads machine out across multiple fault domains.

Datacenter networks have been designed to tolerate failures of network equipment and provide sufficient bandwidth. In practice, however, failures and maintenance of networking and power equipment often make tens to thousands of servers unavailable, and network congestion can increase service latency. Unfortunately, there exists an inherent tradeoff between achieving high fault tolerance and reducing bandwidth usage in network core; spreading servers across fault domains improves fault tolerance, but requires additional bandwidth, while deploying servers together reduces bandwidth usage, but also decreases fault tolerance. We present a detailed analysis of a large-scale Web application and its communication patterns. Based on that, we propose and evaluate a novel optimization framework that achieves both high fault tolerance and significantly reduces bandwidth usage in network core by exploiting the skewness in the observed communication patterns.

During my Master’s, I worked on several variations of a similar problem called virtual network embedding in the context of network virtualization (INFOCOM’2009, Networking’2010, VISA’2010, ToN’2012).

This year 32 out of 235 papers have been accepted at SIGCOMM,  seven of which have at least one Berkeley author.

Leave a Reply

Your email address will not be published. Required fields are marked *