Next: About this document ...
=90mm
Scalable Systems Lab
Benchmarking OS Bypass In MPI Implementations
Bill Lawry
Riley Wilson
Arthur B. Maccabe
31 January 2002
Outline
- Introduction / Background
- High-Performance Computing Issues
- Benchmark Process Models
- Benchmark Methods
- Interpreting Benchmark Results
- Summary
Introduction
Scalable Systems Lab
- Design and Implementation of Large Scale, High-Performance
Computing Systems
- Lab on FEC 3rd floor, South end of Building
- Jemez/Bulk Cluster
Presenting
- Benchmarking (90%)
- Fundamentals of System Courses
(CS 341/481/587)
Background
- High-Performance Computing-
=170mm
Source: ``Effects of Communication Latency, Overhead, and Bandwidth
in a Cluster Architecture''
Martin, Vahdat, Culler, and Anderson;
University of California Berkeley
Background
Performance Metrics
- Bandwidth
- Latency
- Gap
- Overhead
Background
``On a cluster of workstations, applications displayed the strongest
sensitivity to network overhead, slowing by as much as a factor of 50
when overhead is increased to roughly 100
microseconds.''
Source: ``Effects of Communication Latency, Overhead, and
Bandwidth in a Cluster Architecture''
Martin, Vahdat, Culler, and Anderson;
University of California Berkeley
HP Computing Issues
Sources of overhead:
=170mm
- Kernel Mode / Context Switching
- Message Control (e.g. Request To Send / Clear To Send Protocol)
- Others
Terms related to overhead: processor availability, user appliction slowdown.
HP Computing Issues
=170mm
Cost of Hardware Interrupt
processor: 500 MHz AMD-K6(tm)-2
OS: Linux Kernel 2.2.14-5.0
network card: Myrinet M2L-PCI64/2-3.0
HP Computing Issues
Request To Send / Clear To Send Protocol
(long message protocol)
=170mm
- Want aggregate performance metrics
- Want insight into CPU and NIC interaction
(bypass)
HP Computing Issues
Developers aim to reduce overhead at the library level, system level,
and network level.
- Decoupling Computation and Communication
- Host Processor Bypass
=170mm
There are a variety of systems.
Traditional Metrics & Tools
Performance Metrics
- Bandwidth, MB/s, 2-way vs. 1-way
- CPU Availability: fraction of CPU time available to user
applications.
Host Processor Bypass (OS and Application)
Traditional Metrics & Tools
- Existing Tools (non-MPI)
- Netperf
- Cycle Soak
- Process Models
=60mm
- Disadvantages
- Overhead of context switching
- Need a good scheduling balance between processes
- Lack of portability
Benchmark Process
Selected Model
- two nodes to facilitate communication
- one process for each node
- one process performs work, does timings, monitors bandwidth
- both processes do message handling
=60mm
Benchmark Methods
- Polling
- Work with periodic test
- Post-Work-Wait
- Time series of post, work, and wait cycles
(ping-pong vs simultaneous messaging)
Benchmark Methods
MPI Standard
A Few Basic Library Calls
- Blocking Calls (MPI_Recv, MPI_Send)
- Non-blocking Calls (MPI_Irecv, MPI_Isend)
- Receives Before Sends (performance issue)
- MPI_Wait vs MPI_Test
Benchmark Methods
=170mm
Benchmark Methods
Variables
- Primary: Loop iterations
- Secondary: Message size
(work with and without message handling)
Benchmark Methods
=170mm
Benchmark Methods
=170mm
Restricted application bypass
Interpreting Benchmark Results
- MPICH/GM - Jemez/Bulk
- Portals 3.0 - Jemez/Bulk
- Wenbin Zhu's Work on Portals
- EMP (Ohio State)
- ASCI Red (Sandia)
Typical Results - Poll
=105mm
=105mm
Typical Results - PWW
=105mm
=105mm
Revisit PWW Timings
=160mm
Durations of Interest:
- Total cycle
- Pre-post (post plus work)
- Work
- Wait
Portals vs MPICH/GM Time - Jemez/Bulk
=110mm

=110mm
Wenbin Zhu Modifications
=110mm

=110mm
Wenbin Zhu Modifications
=110mm
=110mm
Summary
Benchmarking OS Bypass in MPI Implementations
Attained
- Bandwidth (user perspective)
- CPU Availability (user perspective)
- Aggregate performance measures using single process per node
- View into operating system bypass
- View into application bypass
Future:
- More testing on different system
- Two methods into one package
- Qualify Performance Data (stats, comparisons)
=120mm
Scalable Systems Lab
Next: About this document ...
2002-02-20