SCICOMP4 Abstracts and Presentation Materials


Meeting Home | Agenda | Registration | Presentations | Local Information | Planning Commmittee | Contact Us | Related Links

Abstracts and Presentation Materials for the Tutorials and Talks

Tutorials

Oned day of the meeting (date to be announced) will be dedicated to tutorials which will be presented by IBM stafff in the same rooms as the regular sessions.
Tutorial abstracts are provided below.

Presentations

The regular sessions will last two and a half days.
Presentation abstracts, given below, are linked to their respective items in the meeting agenda. Aside from the Keynote Address, abstracts in this section are divided into two sets: those by IBM staff and those by SP system users.

Presentation Materials

Presentation materials that are provided will be linked to the abstracts below.


Tutorial Abstracts

Tutorial 1: High Performance on the Power 4 for Real-World Codes

Bob Blainey, IBM
Charles Grassl, IBM
Mark Fahey, ORNL

James B. White III, ORNL

The pSeries models 630 670 and 690 ("Regatta") use the POWER4 processor. Though the processor architecture and instruction set have few changes from POWER3 CPUs, the overall system design is much different from Winterhawk and Nighthawk based systems.

The POWER4 systems have the same programming paradigm as used on POWER3 systems: both shared memory and distributed memory, or threads and tasks. The POWER4 systems have an extra level of cache and another level of memory hierarchy. These system features, along with more shared resources, have subtle on programming techniques.

In this tutorial we will discuss both the performance tubing programming techniques. We will use discuss performance tuning from three points of view:

  1. The macro architecture and it resources.
  2. The compiler and its strategies.
  3. Case studies with production applications.


Tutorial 2: The Paraver Toolset

Prof. Jesus Labarta
CEPBA/UPC
Jordi Girona 1-3, Modulo D-6
08034 Barcelona
WWW: http://www.cepba.upc.es
Phone: +34 93 401 69 87
Fax: +34 93 401 70 55
e-mail: jesus@cepba.upc.es
Spain

Performance tuning of parallel programs presents great difficulty due to the huge number of factors that influence performance and the intricate ways in which they interact.

To be able to properly understand the behaviour of a program under such circumstances it is of key importance to have very flexible tools that can be used to get responses to the hypotheses the analyst may perform. In the whole analysis, one answer leads to the next question often in an unpredictable direction.

Paraver is a trace analysis tool that lets an analyst to fully extract the huge amount of information that is actually captured by a single trace. Flexibility, level of detail and powerful statistics are the major strengths of the tool.

The tutorial will briefly describe the basic approaches in tools for performance analysis to rapidly enter to present the structure and concepts behind Paraver from the user point of view. Concepts will be presented accompanied by life demos.

Implementation details of the instrumentation for MPI and OpenMP will be given along with generic considerations on general tracing issues.

The tutorial will go beyond the description of the functionalities of the tools. We will show how the tool has been used to better understand the behaviour of a wide range of codes, systems and type of information obtained. An example of this orientation corresponds to the discussion of how the analysis of tracing overhead can report information about OS kernel issues such as frequency of interrupts and the skew of those interrupts between processors. Other examples show how the user can reverse engineer features of an MPI implementation without any access to its source code. To demonstrate the flexibility of the tools we will show how it is possible to analyse numerical MPI/OpenMP/mixed applications both for Power3 and Power4.

We will also show examples of the analysis of UTE traces generated by the PEbenchmarker tool, showing how Paraver let the user extract the tremendous amount of information that UTE and AIXtrace files in general contain.

Dimemas is a simulator that based on a trace that characterizes the application is capable of generating a Paraver trace showing what would have been the performance of the program on a machine characterized by a certain latency, bandwidth and connectivity. We will briefly present Dimemas and show how you can analyzie with Paraver the traces it generates.


Presentation Abstracts

Keynote Address

Facts and Wishful Thinking about the Future of Supercomputing

Horst D. Simon
Lawrence Berkeley National Laboratory
Berkeley, CA 94720
(510) 486-7377
hdsimon@lbl.gov
http://www.nersc.gov/~simon

The installation of the 40 Tflop/s Earth Simulator in Japan in early 2002 has presented computational science in the U.S. with a serious challenge. Not only does the Earth Simulator (ES) constitute an extraordinary foreign investment made in computing for fundamental science, but as the new #1 system on the TOP500 list, the ES (and the architectural choices made in its design) force the US supercomputing community to a critical re-evaluation of current trends and assumptions in computer architecture.

The computational science research community has been vocal about a focused US response to the Earth Simulator, and vigorously started formulating a new initiative led by the DOE Office of Science. Unfortunately the computer vendor response was less enthusiastic. For example IBM circulated a brief titled "IBM's response to the Japanese Earth Simulator", which can be best characterized as "business as usual will do the job". Based on application experience on NERSC's 5 Tflop/s "Seaborg" platform, and on our own technology assessment, I will make the case that the current vendor response is based on wishful thinking and is insufficient for maintaining the US leadership in scientific supercomputing in the long term. Ignoring this problem will hurt progress in computational science. As a community, vendors, users, and funding agencies must develop a new model of cooperation, which assures both vendor profitability, and rapid technological innovation for the benefit of science.

Horst Simon is the Director of the NERSC Center and Computational Research Divisions, at Lawrence Berkeley Laboratory.

Presentation Materials are here.


IBM Presentations

IBM Overview of HPC

Peter Ungaro, IBM

Peter Ungaro will review the current state of the HPC market and discuss IBM's HPC solutions and future directions.

Presentation Materials are here.


IBM Processor Architecture Roadmap

Don Grice, IBM

The session will cover the POWER4 architecture and its application in SMP servers and clusters. Emphasis will be on the POWER4 chip itself, with a focus on performance-oriented features such as bus BWs and out-of-order execution capability. Integration of the chip into various SMP configurations and incorporation of these SMPs into HPC clusters will be included.

Presentation Materials are here.


TurboSHMEMPI

David Klepacki, IBM
Advanced Computing Technology Center
IBM T. J. Watson Research Center
Yorktown Heights, NY, USA

Release 2.0 of TurboSHMEM/MPI is now available. This release provides dramatic performance improvements on both Power3 and Power4 systems. The much needed improvements to barrier and the sum/min/max_to_all (e.g., allreduce functions) have been improved by as much as 400%. In fact, these particular functions have been optimized enough to warrant using them for the equivalent MPI functions. So, the MPI barrier and allreduce functions can now be improved simply by linking the turboSHMEM library to your MPI application. In addition, although not a traditional function in SHMEM, a shmem_alltoall() function has been added. This allows SHMEM codes to utilize this simple call, rather than the current technique of using a loop of IX_put/get calls.

Presentation Materials are here.


IBM Compiler Roadmap

Bob Blainey, IBM

Bob Blainey will discuss the current status of the C, C++ and Fortran compiler products for Power4 and the features and technology being explored for future releases.

Presentation Materials are here.


IBM End User Tools

Luiz DeRose, IBM

Presentation Materials are here.


AIX and Threads

Tom Mathews, IBM

Presentation Materials are here.


File Systems and Parallel I/O

Bob Curran, IBM

Presentation Materials are here.


MPI Scaling

William Tuel, IBM

Presentation Materials are here.


Future Vision for IBM HPC

Jamshed Mirza, IBM

This session will explore IBM's direction in high performance computing over the next 3-5 years. We will discuss technology challenges for HPC, including packaging, power, cooling, management, ease of use and capability. We will look at the Blue Gene/L project and its relevance to our future in HPC, and the synergy and interplay between IBM's eLiza program for autonomic computing and Grid Computing.

Presentation Materials are here.


User Presentations

Session I. MPI Tools

PERUSE: An MPI Performance Revealing Extensions Interface

Rossen Dimitrov rossen@mpi-softtech.com
MPI Software Technology, Inc.
Anthony Skjellum, Terry Jones, Bronis de Supinski, Ron Brightwell, Curtis Janssen, MaryDell Nochumson

This talk presents PERUSE, a performance-revealing extensions interface to MPI. This interface is intended to provide greater insight into the performance related processes and interactions between application software, system software, and MPI message-passing middleware than the standard MPI Profiling interface (PMPI) defined by the MPI specification. This additional level of detailed performance information can be used by scalable parallel algorithm developers for tuning purposes and by developers of performance monitoring tools.

The MPI Forum defined PMPI as a mechanism for application developers to obtain high-level performance information about the behavior of both the application algorithm and the parallel system. PMPI has proven its usefulness, and a large number of tools have emerged to address the needs of parallel performance evaluation. However, PMPI provides for relatively high-level profiling and does not give enough insight into the complex processes that take place within the MPI library or in the software layers below MPI with which the MPI library interacts. These issues become of vital importance for understanding the performance and scalability of ultra-scale parallel platforms, such as the ones used by the DOE ASCI program.

The main objective of PERUSE is to provide detailed information about events and activities that take place within the MPI library or in the lower software layers but can be observed by the MPI library. PERUSE aims at giving more information about the internal state and processing of the MPI library than just looking at MPI as a black box (the level of profiling available from PMPI). This information can then be used in application and system level performance evaluation, which is not possible now by only using the standard PMPI interface. In this respect, PERUSE complements PMPI.

PERUSE's API is designed to be portable across platforms and MPI libraries. However, the intended level of detail and precision that is sought does not allow for high-level abstractions of the MPI services. In order to respond to its objectives, PERUSE makes certain assumptions about the implementation of the MPI libraries – these assumptions are based on past experience of the MPI development community. Since the MPI standard does not mandate any specific implementation but only compliance with the standard, not all assumptions made by PERUSE may be valid for all MPI libraries. In order to accommodate this, PERUSE proposes a variable level of compliance that will allow MPI libraries to provide the information that will be precise and useful to the users and will not require these libraries to attempt to satisfy some possibly artificial abstraction layer.

PERUSE will provide capabilities for profiling point-to-point message passing, message queues, collective communication operations, derived datatypes, parallel file I/O, and one-sided communication. The level of detail will be profiling of a single operation for all relevant categories. Capabilities for collecting and presenting statistical information over a number of operations will also be provided. PERUSE is intended to be processing overhead and memory conscious so it can provide a precise view on the profiled events. Performance transparency of PERUSE implementations will be critical for the usefulness of this interface.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48, UCRL-JC-149392-ABS.

Presentation Materials are here.


pyMPI -- Parallel, Distributed Python

Patrick Miller patmiller@llnl.gov
LLNL

Scripting languages have long been used to access and control simulations -- Mathematica, Matlab, Scheme, Python, and even BASIC have been used. However, high performance platforms are typically parallel today. More than that, the memory spaces are typically distributed. The MPI interface and associated implementations opened a new, portable way to program in C and FORTRAN. Adding an MPI implementation to a traditional scripting language opens a new vista for programming on distributed machines.

The talk will describe the pyMPI tool, an Open Source extension of Python that allows users to write parallel scripts and to control parallel codes with parallel scripts. It implements many MPI library functions and allows users to write and prototype rather general parallel programs. When coupled with high performance Python extensions, it provides an excellent framework for building scientific applications. I'll talk about strategies and issues for implementing Python extensions for high performance, parallel codes.

Presentation Materials are here.


Distributed Dynamic Correctness Testing of MPI Programs

Bronis R. de Supinski bronis@llnl.gov
Jeffrey S. Vetter
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

Debugging MPI applications can be difficult. Software complexity, data races, and scheduling dependencies can make simple programming errors very difficult to locate with manual debugging techniques. Worse, few debugging tools are even targeted to MPI abstractions and error messages from MPI implementations are often misleading when the programmer uses MPI incorrectly. As a result, users rely on a spectrum of time-consuming and complicated ad-hoc techniques to locate MPI programming errors. Clearly, MPI programmers need tools that simplify the code development process.

We previously presented Umpire, an innovative tool that dynamically analyzes any MPI application for typical MPI programming errors. Examples of these errors include resource exhaustion and configuration-dependent buffer deadlock. Umpire performs this analysis on unmodified application codes at runtime by using the MPI profiling layer. Our original implementation required shared memory communication between MPI tasks and, thus, was limited to running on a single SMP-node.

We have implemented a distributed memory version of Umpire. We overcome the shared memory requirement through techniques that are similar to those employed in high-performance multi-threaded MPI implementations. In addition, we have significantly improved the robustness of this tool and extended the range of errors that it detects. This version of Umpire has identified several MPI programming errors in Sphinx, a widely-available MPI benchmark suite and initial performance results with several applications are promising. This talk will present our distributed design and preliminary results; we will also discuss key issues for identifying complex MPI programming errors, such as deadlocks involving MPI_Recv with MPI_ANY_SOURCE.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48, UCRL-JC-149019 ABS.

Presentation Materials are here.


Session II. Optimization

Performance of the ALE3D code on IBM Systems

Ping Wang wang32@llnl.gov
Lawrence Livermore National Laboratory
Rob Neely, Bob Cooper, Richard Sharp
Lawrence Livermore National Laboratory

ALE3D is a three-dimensional time-dependent multi-physics parallel code which includes Arbitrary Lagrangian Eulerian (ALE) Hydrodynamics (explicit and implicit), thermal transport, chemistry, advanced material modeling, and incompressible flow. To understand the performance of the code on parallel computers such as the ASCI systems at LLNL, it is important to model realistic and complex physical problems. We have developed several scaled test problems that can be used to study the behavior of the code over a large number of processors. This presentation will include a survey of various tools used for performance analysis, along with several strategies through which performance gains were obtained.

Presentation Materials are here.


Optimization and Scaling of an OpenMP LBM Code on IBM SMP nodes

Federico Massaioli federico.massaioli@caspur.it
Giorgio Amati giorgio.amati@caspur.it
CASPUR, c/o Universita' La Sapienza, P.le A. Moro 5, 00187 Roma, Italy

We describe the optimization and OpenMP parallelization on a 16 CPUs IBM Power3 Nightawk node of a simple but real production code implementing the well known Lattice Boltzmann Method to solve Navier-Stokes equations on a regular lattice.

The fluid is represented by 19 different particle species, each one moving on the 3D lattice with fixed, preassigned velocity (streaming). At every lattice site, particle species interact, adjusting their values according to the physics of the flow (collision). Every particle species is represented with a separate three-index array. The test lattice is a grid of 256x128x128 sites. The original, vectorized code follows the usual convention of separating the local, computationally intensive, collision phase from the streaming phase. The latter is implemented with a suitable in memory shift for each particle species array.

The first step consists in the optimization of the collision phase. Careful parenthesization of computations allows the compiler to better perform common subexpression elimination, significantly reduce register spills and reach better performance.

The collision step is embarassingly parallel, while the streaming step is affected by of data dependencies, different for every array. Part of the shifts can be easily parallelized using Parallel DOs, while 10 of them cannot, and have to be managed with Parallel Sections. The resulting code scales poorly: global speedup of 8.29 on 16 CPUs, 5.97 for the streaming phase.

The load imbalance in the Parallel Sections can be eliminated by rearranging the 3 loops on lattice coordinates so that the outermost is free from data dependencies: the compiler can rearrange the loops in cache friendly order. However the speedup of the streaming step goes up just to 7.06 on 16 CPUs, as the sustained memory bandwitdh of the in memory shifts (6.5 GB/s), appears saturated from 8 CPUs up.

In the streaming step, all the variables are read from and written back to memory, as during collision, where the effect is hidden by computations. Performing the streaming during the collision halves memory accesses. The introduced data dependencies can be avoided by substitution of in memory shifts with suitable shifts of array indexes. The global speedup goes to 10.4 on 16 CPUs, but the new implementation is 40% faster!

This 'fused' implementation runs at 6.7x10-7 s per grid site, but performance degrades smoothly as the whole data size grows, with a plateau of 1.7x10-6 s above 4 GB. The cause can be attributed to thrashing of the Power3 Segment Lookaside Buffer, a problem not affecting the new Power4, as preliminary measures show.

The problem on the Power3 can be solved by pairing together couples of particle species in a single array. By suitable choices, the memory access pattern for each array still exhibits very good data locality in the cache, MPI Toolsand runs at 7.2x10-7 s. per grid site, independent of the grid.

The impact of the modifications on DSM systems and on the Power4 architecture will be discussed in the presentation.

Presentation Materials are here.


Session III. MPI Performance

MercutIO: A High-Performance Portable MPI-IO Implementation

Kumaran Rajaram kums@mpi-softtech.com
MPI Software Technology, Inc.
Anthony Skjellum, Rossen Dimitrov, David Leimbach, Vijay Velusami, Andrew Watkins, Terry Jones, Tyce Mclarty and Bronis de Supinski

MPI-IO, the parallel I/O functionality of MPI-2, is a portable interface designed specifically to achieve high performance. MPI-IO supports features, such as non-contiguous file access, collective I/O, asynchronous I/O, file pre-allocation, shared file pointers and portable data representations, that promise increased I/O parallelism and guarantee I/O portability. The I/O characteristics of scientific applications running in a parallel environment differ significantly from those running in uni-processor systems. These applications evidently access files in small, non-contiguous pieces, and these pieces tend to be regularly sized and spaced. Performing individual disk accesses to satisfy such I/O patterns would incur high I/O latency, significantly reducing the performance of the parallel application. Another problem of major concern is that most existing file systems have no support for asynchronous I/O. Non-blocking file access semantics, if supported, are polling based. As a consequence, the asynchronous I/O implementations in existing MPI-IO implementations are either polling based or blocking, depending on the support from the underlying file system. In either case, the CPU is forced to idle while waiting for the completion of I/O. The goal of this work is to address these problems and to attempt an efficient MPI-IO implementation that ensures portability with minimal performance tradeoffs.

This paper presents Bulldog Abstract File System (BAFS), an object-oriented portable I/O layer that facilitates an efficient implementation of parallel I/O APIs across disparate file systems. BAFS consists of a set of APIs that provides the necessary functionality for parallel I/O. The parallel I/O APIs are implemented on top of the BAFS APIs, which in turn are implemented efficiently across different file systems. This design ensures portability with low overhead.

This work facilitates a flexible approach to handle non-contiguous I/O access patterns. Depending upon the I/O block-size and hole size, an intelligent heuristic enables the middleware to toggle dynamically between the data agglomeration technique or the Unix-style access mechanism. The data agglomeration technique includes a two-phase algorithm for collective I/O and a data-sieving technique in the case of non-collective APIs. The asynchronous I/O implementation is based on a producer-consumer model. Threads are used to transfer data between the producer and consumer using a work queue model. Run-time options through file-info hints are provided to perform file access through blocking semantics. Data access using shared file pointers is achieved using the Unix file lock mechanism. MercutIO currently supports NFS, UFS, Parallel Virtual File System (PVFS), IBM's General Parallel File System (GPFS), and Extended Network File System (ENFS).

This talk also offers a quantitative description of agglomeration of I/O requests and the asynchronous file access model, which lay the foundation for efficient MPI-IO design. Two new I/O metrics, namely degree of overlapping and degree of non-contiguity, which are essential in the performance appraisal of a parallel I/O implementation, are introduced. Preliminary experiments were conducted by researchers at Lawrence Livermore Laboratory on GPFS. Results indicate MercutIO outperforms the proprietary IBM MPI-IO implementation as well as raw POSIX for strided and segmented access patterns.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48, UCRL-JC-149393-ABS.

Presentation Materials are here.


Early experiences with the IBM SP Switch-2 on a distributed pSeries 690 cluster

Hans-Hermann Frese
Zuse Institut Berlin (ZIB)

The Zuse Institut Berlin, in co-operation with the University of Hannover, installed a distributed pSeries 690 cluster at two distant sites which includes a cluster of 12 pSeries 690 servers at each site which are locally connected through the SP Switch-2, whereas the two sites are connected through a dedicated fiber link. We will show how the whole complex has been configured to provide a single point of control and a single system from the users' point of view. We will also present early performance figures for single cluster and distributed applications and compare them with other systems. We will give an outlook on future configuration changes to enhance the performance for distributed applications.

Presentation Materials are here.


A Scaling Investigation on IBM-SPs

Terry Jones trj@llnl.gov
Lawrence Livermore National Laboratory

LLNL and IBM have actively been working together to investigate MPI scalability issues exposed by MPI_Allreduce and MPI_Barrier when running large processor-count jobs. In recent months, this effort has expanded scope to include non-MPI scaling for fine-grain parallel algorithms such as ring-communication patterns and fine-grain OpenMP usage. We examine how LLNL is impacted by the problem, preliminary OpenMP results, and some thoughts on the logistics of these customer/vendor R&D investigations. Inter-node measurements of 16-way Power3 machines and intra-node measurements of 32-way Power4 machines are discussed.

Presentation Materials are here.


Session IV. Large Scale System Experiences

The LANL CY2001 Milestone

W. Robert Boland wrb@lanl.gov
Los Alamos National Laboratory

During calendar year 2001, members of the Crestone Project team at Los Alamos National Laboratory completed a National Nuclear Security Administration (NNSA) level-1 milestone for the Advanced Simulation and Computing (ASCI) Program. The complex calculation, which ran on the ASCI White machine at Lawrence Livermore National Laboratory, began on March 1 and was successfully completed on October 26. It pushed the hardware, software, and human boundaries.

In this talk we will describe and critique certain features, issues, and problems associated with the heroic efforts that were required to complete the unprecedented simulation. We will offer experience-based suggestions to the high-performance computing community on systems, Tools, user services, communications, storage, code developers, and the user community.

Presentation Materials are here.


Installation, Testing, Integration, and Initial Evaluation Results of the IBM Power4 System at the Naval Oceanographic Office Major Shared Resource Center

Christine E. Cuicchi cuicchi@navo.hpc.mil
David K. Magee magee@navo.hpc.mil
Naval Oceanographic Office (NAVO) Major Shared Resource Center (MSRC)

The Naval Oceanographic Office (NAVO) Major Shared Resource Center (MSRC) presents a report discussing its newest terascale HPC system, a 1,184 processor IBM Power4 system. This system is presently the largest installation of a Power4 system in the world. The configuration and installation of this system will be discussed, and results from integration, effectiveness level testing, applications testing, and benchmarking will be presented. The system will also be compared to its NAVO MSRC predecessor and operational sister, a 1,336 processor Power3 system.

Presentation Materials are here.


Operational Numerical Weather Prediction on a large IBM/SP

George VandenBerghe gvandenb@us.ibm.com
NOAA National Center for Environmental Prediction (NCEP)

The National Weather service National Center for Environmental Prediction (NCEP) provides weather and climate forecasts for both national and international users. The forecast process requires extreme computer power. Furthermore, delivery of enormous com- pute power is not sufficient, the platform must also be reliable and runtimes must be consistent. The current machines tasked with this mission, are 276 and 308 node WHII SP systems. These, and a single predecessor 384 node WHI SP have been the NCEP primary NWP compute servers since late 1999. This presentation will discuss a brief history of NCEP computers, the current SP system, the current workload, how it parallelizes, procedures to insure runtime consistency, common optimizations that generalize to other problems, and system constraints. Overall, distributed memory computing has been an outstanding success at NCEP.

Presentation Materials are here.


Session V. The Performance Evaluation Research Center Tools

PAPI and DynaProf: Dynamic Hardware Performance Profiling on the Power Architecture

Philip Mucci mucci@cs.utk.edu
University of Tennessee

This talk will give an update on PAPI and a new version of the DynaProf tool. PAPI, the Perfomance Application Programming Interface is a portable library that gives the developer sophisticated access to the hardware performance counters present on a variety of Microprocessors, from Pentiums to Powers. Statistical profiling and sampling is supported as well as traditional aggregate counts.

DynaProf is a hardware perofrmance analysis tool for dynamic instrumentation of the executable at runtime. Run time instrumentation of the object code provides numerous advantages on large scale applications running on modern architectures. DynaProf is modular in that the instrumentation it inserts has a well defined format, allowing new and customized performance probes to be built easily.

This talk will outline the current release of PAPI, discuss the status of PAPI 3.0 and demonstrate the usage of DynaProf to gather dynamic performance data.

Presentation Materials are here.


SvPablo: A Toolkit for Performance Tuning and Visualization on Parallel Systems

Ying Zhang zhang8@cs.uiuc.edu
Celso Mendes cmendes@cs.uiuc.edu
University of Illinois at Urbana-Champaign

SvPablo is a performance analysis and visualization system for both sequential and parallel platforms that provides a graphical environment for instrumenting scientific applications written in a variety of languages and for browsing their performance data correlating to the source code.

SvPablo supports both interactive and automatic instrumentation for C, Fortran 77, and Fortran 90 programs. During execution of the instrumented code, the SvPablo library captures and computes statistical performance data for each instrumented construct on each processor. Because it only maintains the statistical data, rather than detailed event traces, the SvPablo library can characterize the performance of programs that run for days and on hundreds of processors. SvPablo captures both software performance data (such as duration and counts) and hardware metrics (such as FP operations, cycles.) Users select a desired set of hardware metrics via a configuration file, although SvPablo provides the default set. After program execution, the SvPablo library generates summary files for each process. They will be merged into one performance file post-mortem via a utility program. This file then can be loaded and displayed in the SvPablo GUI. Performance data is presented in SvPablo GUI via a hierarchy of color-coded performance displays, including a high-level procedure profile and detailed line-level statistics. The GUI provides a convenient means for the user to identify high-level bottlenecks (e.g. procedures) and then explore in increasing levels of detail (e.g. identifying a specific cause of poor performance at a source code line executed on one of many processors). Moreover, the user can access and load performance data from multiple prior executions, including different numbers of processors and hardware platforms. This allows one to do cross platform comparisons.

SvPablo has been used in the performance analysis of large scientific applications on various parallel systems including IBM SP and linux clusters.

This presentation provides an overview of SvPablo, introduces tool usage, and presents results obtained using SvPablo in the performance analysis of two scientific code on IBM SP system.

Presentation Materials are here.


A Frame Work for Getting Application Signatures

Xiaofeng Gao xgao@cs.ucsd.edu
CSE department, University of California, San Diego
Allan Snavely allans@sdsc.edu
SDSC

Research in performance modeling and prediction relies heavily on application traces. But full traces of application events such as memory, compute, branch, and communication operations require gigabytes of storage. One way to circumvent such pressure on the file system is to analyze instrumented trace information on the fly and to record only summarized events in the trace file. But this approach, by adding processing overhead to each instrumented event, generally has severe slowdown relative to the runtime of the original un-instrumented application (and thus just trades space for time). Furthermore, this approach requires a complete re-run if any additional events need to be gathered or if the application is changed. Thus these approaches do not take advantage of the fact that many applications have regular structure that can be used to reduce the amount of information that must be stored or processed to represent their behavior.

Several researchers have proposed methods to encode and compress call graph and memory event traces. Most of these proposed methods use generic lossless compression algorithms and do not represent the application control-flow structure. Also, these approaches do not allow systematic control of the quality of compression.

We propose an efficient systematic framework to get application signatures, which are concise approximations of the full trace. Application signatures are much smaller than full event traces, but they preserve enough information about the application to enable performance understanding. In our framework, the signature directly reflects the application structure and the amount of preserved detail can be controlled by several factors in the compression scheme. Our framework embodies a multi-level approach. First, by instrumenting loop heads and tails in the application, we collect the sequence of loop executions and encode this sequence on the fly to get a highly compressed Signature of Application Execution (SAE). An SAE provides a skeleton of the execution of the application. SAEs preserve the structures of the applications and directly reflect dynamic behaviors such as compute phase changes, function calls, and loop iterations. By adjusting similarity analysis factors, we can control the compression ratio and approximation rate to make tradeoffs between time/space resources consumed and verisimilitude of the trace.

Dynamic behaviors such as memory reference patterns are dependent on control flow. Thus SAE can be used as the base for more detailed signatures. From SAE, we can determine the most important segments of the application execution and identify how those segments interact during execution. Then, by further light weight tracing or sampling on just these identified segments, we can generate signatures for memory reference patterns, branch target patterns etc. Alternatively for more accuracy, we can do comprehensive event tracing using SAE to guide on the fly analysis. SAE can enable the encoding scheme to determine boundaries and make better encoding decisions. Generally with SAE, we can get better compression ratios and more highly structured signatures than with generic compression schemes.

We tested our approach on NPB Class W benchmarks. The sizes of SAEs varied from several hundred bytes to several kilobytes, compared to full loop execution traces that were on the order of several hundred megabytes. More importantly, we used SAEs to obtain memory reference patterns by very light weight sampling. These consumed little time and space while complete memory traces of the benchmarks would have consumed gigabytes of storage. From our experiments, we find the deduced memory reference signatures from SAEs with sampling were sufficient for accurately modeling the memory system performance of these benchmarks.

Presentation Materials are here.


Session VI. Mixed Programming Models

MPI and OpenMP Paradiagms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transpose

Yun (Helen) He yhe@lbl.gov
Chris H.Q. Ding chqding@lbl.gov
Lawrence Berkeley National Laboratory

We investigate remapping multi-dimensional arrays on cluster of SMP architectures under OpenMP, MPI, and hybrid paradiagms. Traditional method of array transpose needs an auxiliary array of the same size and a copy back stage. We recently developed an in-place method using vacancy tracking cycles. The vacancy tracking algorithm outperforms the traditional 2-array method as demonstrated by extensive comparisons. The independence of vacancy tracking cycles allows efficient parallelization of the in-place method on SMP architectures at node level. Performance of multi-threaded parallelism using OpenMP are tested with different scheduling methods and different number of threads. We parallelize both methods using several parallel paradiagms. At node level, pure OpenMP outperforms pure MPI by a factor of 2.76. Across entire cluster of SMP nodes, the hybrid MPI/OpenMP implementation outperforms pure MPI by a factor of 4.44, demonstrating the validity of the parallel paradiagm of mixing MPI with OpenMP.

Presentation Materials are here.


A comparison of parallel programming paradigms on IBM SP systems

Giorgio Amati giorgio.amati@caspur.it
CASPUR c/o Universita' La Sapienza, P.le A. Moro 5 00185 Roma Italy
Massimo Bernaschi massimo@iac.rm.cnr.it
CNR IAC
Federico Massaioli federico.massaioli@caspur.it
CASPUR

We report about our experience with three different parallel implementations of the Lattice Boltzmann Method (LBM), an alternative numerical method to traditional Computational Fluid Dynamics (CFD) for the simulation of fluid flows.

The base of the method is the Boltzmann kinetic equation and in the last fifteen years, a significant effort has been devoted to its numerical solution.

The LBM is pretty suitable for parallel processing however, for large scale simulations running on clusters of SMP systems it is unclear which paradigm is better in terms of global performance and scalability.

In the present work we compare OpenMP, MPI and Hybrid OpenMP-MPI versions of a Lattice Boltzmann code. The main goal is to assess the soundness of the Mixed OpenMP-MPI approach for the specific SMP based cluster architecture of the IBM SP.

Actually, the performances are determined by the memory and communication subsystems so the point is whether the hybrid paradigm is able to exploit at their best these components.

The simulation reproduces the behavior of a turbulent channel, a typical CFD application.

- We consider different test cases, from 0.5 GB up to 5 GB in order to stress the memory subsystem and check for potential performance degradations as size grows up.

- Several domain decomposition schemes (along one, two and three dimensions) have been employed to test the communication subsystem with different patterns of data exchange.

This makes possible to determine the actual impact of communication on the global performance.

- Up to 8 nodes of an IBM/SP Power 3 have been used for a total of 128 CPUs.

Preliminary results can be summarized as follows: up to 16 CPUs, OpenMP performance is comparable or slightly better than MPI. The MPI version, for a fixed number of CPUs, does better when multiple nodes are employed. For instance, a medium size test case (360*180*180) running on 16 CPUs is 20% faster with a two nodes configuration (8 CPUs per node) than in case of all CPUs allocated on the same node. This can be of interest not only for the performance of the single job but also for the seek of the optimal scheduling of multiple jobs.

The hybrid code (OpenMP+MPI) never outperforms the "pure" MPI version in our tests although we tried different splitting schemes (this means changing the number of threads per MPI process as well as the number of threads performing communications). This last result is consistent with those produced by some recent experiments with synthetic benchmarks and real world codes.

It is unclear at this time whether these results are due to the lack of a theoretical analytical model of such environment, to the inadequate software support for the hybrid programming model (the combination OpenMP+MPI could be sub-optimal), to intrinsic limits of the current cluster of SMPs platform.

Presentation Materials are here.


Implementation and Performance of an Advanced Atmospheric Dynamical Core on IBM Systems

Arthur A. Mirin mirin@llnl.gov
Lawrence Livermore National Laboratory
S.J. Lin lin@dao.gsfc.nasa.gov
W. Putman wputman@dao.gsfc.nasa.gov
NASA/Goddard Space Flight Center, Greenbelt, MD
W.B. Sawyer sawyer@geo.umnw
Swiss Federal Institute of Technology, Zurich

We discuss the implementation and performance of a novel dynamical core (dycore), developed at the NASA/Goddard Space Flight Center and included within the newly released Community Climate System Model 2 (CCSM2) and Community Atmospheric Model 2 (CAM2) of the National Center for Atmospheric Research (NCAR). The finite-volume approach used in this dycore enables the conservation of mass, momentum and energy. A semi-Lagrangian treatment is used to model the transport within the volumes bounded by flux surfaces, and the locations of the flux surfaces themselves are updated self-consistently.

Parallelization is based on a multiple domain decomposition approach together with a hybrid MPI/OpenMP programming model. Each subdomain in a decomposition is assigned an MPI task, with additional parallelism attained in shared memory. A latitude/vertical decomposition is used for advection and transport within a flux volume. The remapping of flux surfaces uses a longitude/latitude decomposition. The column physics (which is actually a separate package from the dycore) uses its own horizontally-based decomposition designed to effect optimal load balance. The various decompositions are connected with high-speed transposes.

Each decomposition is well-suited to the portion of the application to which it is assigned. However, one important phase of the transport package (the geopotential calculation) requires non-local communication in the vertical direction. Two approaches have been formulated and evaluated. One involves global transposes, and the other involves more restricted communication together with 128-bit arithmetic to effect bit-for-bit agreement across different domain decompositions.

The communication layer supports both traditional MPI along with MPI-2 remote memory access (RMA). Although the RMA leads to significantly improved performance on the SGI Origin, there is very little difference in performance between both MPI methods on the IBM.

The OpenMP parallelization of the transport is predominately in the vertical direction, which is also one of the decomposition directions. That then limits the degree of shared memory parallelism for many cases of interest. As these parallel regions are long outer loops in the vertical coordinate, the only way to increase the degree of shared memory parallelism is through nested constructs. We are presently implementing nested parallel loops in the code and will be testing on the Compaq until IBM supports nested OpenMP parallelism.

The codes that implement this dycore are large, complicated codes that tax the capabilities of the operating system and the Fortran-90 compiler (particularly the OpenMP outlining and interprocedural analysis). CAM2 will typically fail on one IBM while working on a similar machine at a different location. We are fortunate enough to have computer accounts on production platforms at four different locations. This facilitates intercomparison and debugging.

This is LLNL Report UCRL-JC-149125. This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

Presentation Materials are here.


Session VII. Tools

Using ZeroFault to Find Memory Errors and Leaks in Large Parallel AIX Applications

John C. Gyllenhaal gyllen@llnl.gov
LLNL

ZeroFault is a powerful commercial AIX-specific memory-error and leak-detection tool. Although it works well on serial and multi-threaded applications, without the use of a wrapper script, it doesn't work well on parallel MPI applications. This talk will present a brief overview of ZeroFault's features, the details of how to make ZeroFault (and other serial tools) work well at scale with wrapper scripts, and eight tips and tricks that you should know before using ZeroFault on parallel scientific applications.

Presentation Materials are here.


The Advanced Computational Software (ACTS) Collection

Osni Marques oamarques@lbl.gov
Tony Drummond ladrummond@lbl.gov
Lawrence Berkeley National Laboratory

During the past decades there has been a continuous growth in the number of physical and societal problems that have been successfully studied and solved by means of computational modeling and simulation. Distinctively, a number of these are important scientific problems ranging in scale from the atomic to the cosmic. These problems are increasingly dependent on high performance computer simulations that utilize robust computational tools such as those available in the Advanced Computational Software (ACTS) Collection.

The DOE ACTS Collection is a set of computational tools developed primarily at DOE laboratories and is aimed at simplifying the solution of common and important computational problems on parallel machines. They tackle a number of computational issues that are common to a large number of scientific applications, mainly implementation of numerical algorithms, and support for code development, execution, interoperability, and optimization.

In the talk we will briefly describe the tools currently in the collection, show examples of how these tools have been used, discuss lessons that we have learned and our agenda towards the creation of a reliable software infrastructure for scientific computing.

Presentation Materials are here.


Session VIII. Applications

An Integrated Parallel Simulation Environment for Electrostatic and Electromagnetic Field Distributions in High Voltage Components

Martin Schulz schulzm@in.tum.de
Cornell University
Carsten Trinitis trinitic@in.tum.de
TU-Muenchen, Germany

When designing components in High Voltage Engineering (such as transformers, insulators, or switchgear for energy supply) having precise knowledge of the electric field distribution is crucial. This applies to both electrostatic and electromagnetic fields. For instance, if the electrostatic field strength is too high, there is a risk of flashovers between electrodes during operation. Likewise, electromagnetic fields can lead to eddy currents, which in turn can cause overheating of components. Thus, it is desirable for the engineer to be able to predict such problems by simulating the electric field. Two different solvers have been implemented for electrostatic and electromagnetic problems, respectively. The field calculation process consists of generating and solving linear equation systems with typically ten thousands of unknowns. Since those computations are quite compute-intensive, parallel versions have been implemented for both electrostatic and electromagnetic problems.

ABB and LRR-TUM are working on a joint project to port these two applications to Linux-based clusters and to integrated them into ABB's workflow. Currently two clusters with eight nodes each - one connected with Switched Fast Ethernet and one with SCI - are installed and serve as production systems for ABB research. Extensions based on IBM clustering systems using Linux are planned for the near future. These plans are supported by the good results achieved on the current systems with both codes showing a high efficiency in almost all cases. The only noticeable exception is the electromagnetic simulation, which has unfavorable scaling properties when run on Fast Ethernet. This behavior is caused by the intensive communication induced by the underlying numerical algorithms. It can, however, significantly profit from the low-latency and high bandwidth communication of SCI leading again to an almost perfect efficiency rate.

Besides further work on improved algorithms and on the integration of automatic parametric optimizers, the efforts in the next few years will be directed towards the establishment of an ABB intranet grid infrastructure.

This project will continue the partnership between ABB and LRR-TUM as the academic partner and will be carried out in close cooperation with IBM Germany, ABB's strategic IT partner. The goal is to enable ABB business units, which are responsible for final product design, to generate simulation problems and to directly submit them to a company-wide simulation service. This eliminates the need for expensive local compute servers and allows a company wide pooling of resources with a low total cost of ownership.

As a first step toward this goal an initial Java client/server system has been designed and implemented, which allows the user to submit geometric input data sets to a cluster for batch processing. On the client side users are provided with a file upload facility for input data, which are then transferred to the server installed on the cluster. The corresponding simulation is then started, and the results are transferred back to the client for postprocessing and visualization. This prototype is currently being extended to handle advanced security models as well as a more flexible resource scheduling under the existence of several target clusters.

Presentation Materials are here.


Parallel Medical and Genomics Applications on Power3 and Power4 Machines

Amit Majumdar majumdar@sdsc.edu
San Diego Supercomputer Center, University of California San Diego

Currently some researchers in the medical and Genomics communities are regularly using parallel applications to speed up their simulations. The use of parallel machines are essential for these communities to complete simulations in a reasonable time. Three parallel applications from medical and Genomics area will be presented in this talk. We will describe the science behind the applications, how the applications are parallelized, the parallel performance, bottlenecks and scaling results on hundreds of processors, and comparison of performance between Power3 and Power4 architectures.

The first medical application is from the field of image guided neurosurgery and addresses the problem of brain shift, the shape deformations the brain undergoes during surgery. This intra-operative application improves surgical visualization and navigation and reduces the amount of tumor remaining after surgery. This application tracks key surfaces of brain locations in the image sequence using a deformable surface matching algorithm. The volumetric deformation field of the objects is inferred from the displacements at the boundary surfaces using a linear elastic biomechanical finite element method. A parallel version of the application uses the PETSc software to iteratively solve the matrix system of equations. This application is developed and used, as an intra-operative research tool by the Surgical Planning Laboratory of the Brigham and Women's Hospital, Harvard Medical School.

The second medical application is from the field of nuclear medicine. The particular application deals with the quantification of Iodine-131 (I-131) for internal dosimetry specifically to treat B-cell non-Hodgkin's lymphoma, the nation's fifth leading cause of cancer. I-131 quantification is carried out by Single Photon Emission Computed Tomography (SPECT). The multiple high-energy gamma-ray emissions of I-131 make imaging and quantification difficult. Monte Carlo simulation is a valuable tool for assessing the problems associated with I-131 imaging such as scatter, penetration and attenuation. Monte Carlo codes typically used in nuclear medicine do not explicitly model collimator scatter and although computationally tedious, including collimator interactions is a prerequisite for accurate simulation of high-energy photon emitters such as I-131. A major limitation of Monte Carlo SPECT codes is speed specially when complete photon transport through both the phantom and collimator is simulated. We will describe a parallel SPECT code, developed jointly with the researchers at the University of Michigan and its scaling results on 512 processors.

The third application is from the field of Genomics and Proteomics. This application is specifically used for matching amino acid sequence of malaria parasite to existing protein databases to identify proteins of the parasite. The amino acid sequences are generated experimentally in the laboratory using the technique of mass spectrometry. The SEQUEST code, developed by researchers from The Scripps Research Institute (TSRI), is used for the matching. A parallel version of the SEQUEST code is developed and allows to solve the protein identification problem in hours instead of weeks required for serial version of the code. This is a joint work between researchers at the Naval Medical Research Center, TSRI and SDSC.

Presentation Materials are here.