next up previous
Next: About this document ...

Benchmarking OS Bypass In MPI Implementations

William ``Bill'' Lawry - Christopher R. Wilson - Arthur B. Maccabe1

Abstract:

Communication protocols traditionally delegate message-handling routines to the Operating System (OS). Since they are executed on the host processor, they take away valuable execution time from the application. By utilizing processors at the network interface, message-handling routines can bypass the OS and the host processor. This decoupling allows for communication progress on the network interface in parallel with computational progress on the host processor. Such parallel progress has a significant effect on application runtime.

Netperf, a commonly used tool for benchmarking TCP and UDP performance, measures host processor overhead during communication. However, Netperf is inappropriate for general MPI (Message Passing Interface) benchmarking because of limitations in its CPU utilization benchmark methods. High-performance MPI applications generally consist of one process per node. However, the documented2 Netperf methods require one of the following: use of specific vendor kernels which will limit portability; use of a separate timing process, introducing context switches that will distort utilization measurements; or use of the less accurate pstat.

Because of the limitations associated with Netperf, we have developed a portable benchmark for MPI applications that measures the relationship between overall communication bandwidth and CPU availability. After considering several process models, we selected a model with one process on each of two nodes and with no kernel modifications. Based on this model, we implemented two benchmark methods.

The first method, called the Polling method, is essentially a standard ping-pong benchmark with messages flowing in both directions to measure aggregate bandwidth. This method uses MPI calls to periodically poll for the arrival of new messages and to reply to these messages. Upon completion of a predetermined amount of work, the method uses the time needed to complete this work to compute bandwidth and availability. Adjusting the polling interval allows us to demonstrate the tradeoff between bandwidth and availability. However, this method cannot demonstrate a limitation that exists in many MPI implementations. Specifically, many implementations require followup MPI calls to make progress during the exchange of large messages.

The second method, termed Post-Work-Wait, does not make MPI calls during its work phase and so can detect any need for followup MPI calls. Each process posts a collection of sends and receives. One of these processes then executes a predetermined amount of work and waits for completion of its communication. This process then reports three times: the time taken to post the sends and receives, the time taken to complete the work, and the time spent waiting for the communications to complete. This execution sequence is repeated for statistical purposes. In this method, the primary control variable is the amount of work between the posting and waiting phases. If the wait time decreases as the amount of work increases this is evidence of concurrent computation and communication.

To date, we have used these methods to benchmark two different MPI implementations: MPICH/GM3 and MPI on Portals 3.04. We have plans to benchmark others. We analyzed a single, meaningful performance metric that is a function of both communication bandwidth and host CPU availability.




next up previous
Next: About this document ...
William Lawry 2001-12-19