X-Trace: A Pervasive Network Tracing Framework

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, I. Stoica, “X-Trace: A Pervasive Network Tracing Framework,” NSDI’07, (April 2007). [PDF]

Summary

Diagnosing problems in modern distributed and networked systems is notoriously hard due to the involvement of multiple administrative domains, application and network protocols, and lack of determinism and reproducibility. X-Trace is a cross-layer, cross-application tracing framework designed to reconstruct the behavior and workflow of a system with an aim to understand the causal paths in network protocols.

Design Principles

X-Trace is guided by the following three design principles:

The trace request is sent in-band to capture the actual condition of the datapath.
The collected trace data is sent out-of-band to avoid network failure and other errors, which the trace is actually reporting back.
The tracing requester entity is decoupled from the tracing receiver entities to provide privacy to different administrative domains (AD) on the datapath and to provide them with the flexibility to decide whether or not to share tracing information with other domains.

Basic Workflow

A user or operator (transparently or non-transparently) invokes X-Trace when initiating an application task, by inserting X-Trace metadata with a unique task identifier in the resulting request. X-Trace metadata information is replicated into each layer and contains the task identifier, an optional TreeInfo:(ParentID, OpID, EdgeType) to later reconstruct the overall task-tree, and another optional destination field. This metadata is propagated down to the lower layers of the network stack (from application to network layer, link layer can also be supported using shim header) using the pushdown() primitive as well as recursively among subsequent requests in different nodes resulting from the original task using the pushnext() primitive. Eventually, for each task, all relevant nodes and layers will carry X-Trace tracing information.

When a node sees X-Trace metadata in a message at its particular layer, it generates a report, which is later used for reconstruction of the datapath. Reports are identified by the relevant TaskID and stored in a local database within an AD. As a result, traces regarding each segment of the datapath are stored in respective ADs. To be able to build the complete end-to-end tracing information and resulting task-tree using all the reports, one must be able to collect all the segments of traces.

Use Cases

This paper presents experiences of using X-Trace on three different scenarios: a simple web request and accompanying recursive DNS queries, a web hosting site, and an overlay network (i3). In all three cases the authors show that X-Trace can build the actual data-, execution-, and propagation-path and represent them in a task-tree (in case of no errors). Whenever there is something wrong – in any layer of any node – X-Trace can pinpoint that. The whole thing is more precise than other systems because X-Trace works with individual components to much deeper levels, and thus have more information to take any decision on what went wrong unlike its counterparts who work on larger components and fail to diagnose tiny details.

Discussion

The overall idea of the paper is pretty straightforward, but implementing/executing it is a very complicated and a huge undertaking. The authors have done a great job in building such a large framework and making it work.

However, there are too many concerns regarding its deployability, complexity, performance, overhead, security, and privacy issues. One great thing is that the authors are aware of the problems, and they discussed each one of them in the paper and proposed workarounds for some of them to give the impression that X-Trace might fall into the category “eventually deployable” instead of “trophy project“.

One thought on “X-Trace: A Pervasive Network Tracing Framework”

Randy Katz says:

November 18, 2009 at 8:43 PM

The good news is that x-Trace can be used to trace and diagnose large scale systems. Coral uses it, 25M file fetches per day, globally distributed, and several bugs were uncovered. This is documented in Rodrigo’s dissertation.

Mosharaf Chowdhury