SCICOMP4 Abstracts and Presentation Materials


Meeting Home | Agenda | Registration | Presentations | Local Information | Planning Commmittee | Contact Us | Related Links

Abstracts and Presentation Materials for the Tutorials and Talks

Tutorials

The first day of the meeting will be dedicated to tutorials which will be presented by IBM stafff in the same rooms as the regular sessions.
Tutorial abstracts are provided below.

Presentations

The regular sessions will last two and a half days.
Presentation abstracts, given below, are linked to their respective items in the meeting agenda. Aside from the Keynote Address, abstracts in this section are divided into two sets: those by IBM staff and those by SP system users.

Presentation Materials

Presentation materials that are provided will be linked to the abstracts below.


Tutorial Abstracts

Tutorial 1: Creating Tools using DPCL

by Ted Hoover, IBM

The Dynamic Probe Class Library (DPCL) is a C++ class library that provides an innovative way to quickly develop tools through the process of dynamic instrumentation. This tutorial provides in-depth coverage on how you can use DPCL to create tools to analyze running applications by dynamically inserting probes (code patches) into the application. DPCL concepts, terminology, and capabilities are discussed. Attendees will learn the fundamentals of creating various types of probes that can be used to gather application data, and the steps required to place these probes within the application's structure.

The URL for the Open Source DPCL work is http://oss.software.ibm.com/developerworks/opensource/dpcl.

Session Objectives:

Presentation Materials are here.


Tutorial 2: A Glimpse into Power 4 Performance Programming

by John McCalpin and Bob Blainey, IBM

Presentation Materials are here.


Presentation Abstracts

Keynote Address
From Nano to Peta: ORNL Collaborations with IBM - Thomas Zacharia, ORNL

Oak Ridge National Laboratory (ORNL) is collaborating with IBM on a variety of projects that span a wide range of scales, from nano to peta. At the nano end, IBM and ORNL researchers are using computational molecular-dynamics simulations to investigate metal clusters for nano-scale electronic structures. At the peta-scale, ORNL is investigating fault-tolerant, scalable algorithms for applications and systems software for IBM's Blue Gene project. Between these scales, the two organizations are working together on high-performance storage, evaluation of early supercomputer systems, and clustering software. With these examples, the presentation will provide details of the scientific progress made possible by the combined resources of a national laboratory and a leading technology corporation.

Dr. Thomas Zacharia is an Associate Laboratory Director at ORNL, heading the Computing and Computational Sciences Directorate.

Presentation Materials are here.


IBM Presentations

Power 4 and Beyond - Peter Ungaro

This presentation will provide a view into IBM's high end POWER4 roadmap, from our first products in late 2001 through multiple generations pushing out into the 2004-2005 timeframe in order to provide a view into the future of this exciting processor and systems roadmap. Actual targeted products, planned to be announced over this timeframe, for the high-end HPC marketplace will be discussed.

Presentation Materials are here.


Production Tools - Ted Hoover

The application development environment is an area that is very important to developers in the scientific and technical computing community. This talk presents an update on the areas in which IBM is investing resources to develop products that support end users and tool development efforts for this environment. The first part of this talk gives a brief update of the current state of IBM's tools for high performance computing. The second part provides information about an Open Source development project called Dynamic Probe Class Library (DPCL) and how DPCL can be used to quickly develop tools through the process of dynamic instrumentation. The last part of the talk will present some research activities that leverage the technology found in DPCL in addition to activities to support other tool requirements of the high performance computing community.

Presentation Materials are here.

Ted Hoover
Senior Software Engineer
Team Lead - Application Development Tools
IBM Poughkeepsie
Poughkeepsie NY
(845) 433-7693
hoov@us.ibm.com


Compilers - Bob Blainey

The IBM compiler group is currently designing and implementing updated C, C++ and Fortran compilers for the pSeries eServers based on the Power4 processor. I will describe the work in progress with particular focus on performance issues. I will cover new and enhanced compiler features including OpenMP, Fortran 2000, support for the Power 4 processor, automatic parallelization and vectorization, and scalar optimization.

Bob Blainey
STSM, Compiler Development
IBM Toronto Laboratory
email: blainey@ca.ibm.com
telephone: 416-448-4264, t/l 778-4264

Presentation Materials are here.


AIX5L and Thread Scheduling

This presentation discusses several key aspects of the next generation of the AIX operating system. AIX5L introduces a new, scalable 64-bit kernel to support higher capacity hardware systems and larger application workloads. As part of the presentation, details of the 64-bit kernel architecture and design are covered along with the customer migration path to the scalable 64-bit kernel environment. Discussion is also focused on the new 64-bit application binary interface provided by AIX5L for the purpose of application scalability. The topics of pthreads and thread scheduling support from AIX Version 4 to AIX Version 5 are also discussed.

Tom Mathews
Senior Technical Staff Member
Manager, AIX Kernel Architecture & Design
(512) 838-3903, T/L 678-3903, Fax (512)838-3484
tmathews@us.ibm.com

Presentation Materials are here.


GPFS: An Enterprise Class Cluster File System - Rama Govindaraju

This talk will focus on the usage models for a CFS (Cluster File System), the key trends that we observe from the usage models, GPFS key design features of scalability, performance and availability and future work.

Presentation Materials are here.


The ACTC Application Performance Tools for Scientific Programs - Current Status and On-going Work

Luiz DeRose
Advanced Computing Technology Center
IBM T. J. Watson Research Center
Yorktown Heights, NY, USA

In this talk I will present an overview of the Advanced Computing Technology Center application performance analysis tools for tuning and optimization of applications running on the IBM SP. The ACTC tools were designed to help users to understand the behavior of applications on complex parallel environments. I will also cover the new components of the HPM Toolkit, HPMprof and HPMviz. HPMprof is based on automatic instrumentation to provide a function level profile with hardware performance counters information. HPMviz provides careful mapping of performance data to source code constructs, as well as, hints to allow users to correlate the behavior of the application to hardware components. Additionally, I will present the Simulator Guided Memory Analyzer (SiGMA), a data-centric tool that is designed to provide feedback to users with regards to the use of the memory hierarchy.

Presentation Materials are here.


Status of ACTC; TurboSHMEM: A High Performance Implementation of the Cray SHMEM API for the IBM RS/6000 SP

David Klepacki
Advanced Computing Technology Center
IBM T. J. Watson Research Center
Yorktown Heights, NY, USA

Running Cray SHMEM applications on the IBM SP has always been a challenge. Due to the very different semantics of the MPI-2 Put/Get specifications, it has not been possible to port from the SHMEM API efficiently using MPI. However, the IBM LAPI interface is ideal for this purpose. IBM LAPI is a zero-copy, one-sided communication layer that is part of the IBM PSSP software. It is the lowest layer of communication software that is device-independent and open to the SP programmer. A complete implementation of the Cray SHMEM API has been developed for the IBM SP based upon LAPI. The "turbo" qualification is meant to emphasize that optimizations have been included to exploit the shared memory on the SP nodes. This presentation will discuss some of these optimizations and its impact on application performance.

Presentation Materials are here.


What are all these environment variables for?

Parallel Environment provides a large number of environment variables to control how the message passing library works. This talk will review the internal architecture of the MPI library and indicate how the various environment variables affect its behavior.

The talk will also summarize some recent studies on the scalability of the MPI_Allreduce function.

Dr. William G. Tuel, Jr.
Communication Protocol Development
Unix Development Laboratory, IBM Server Group
Poughkeepsie, NY 12601
Phone (845) 433-7850 (T/L 293-)
Email: tuel@us.ibm.com

Presentation Materials are here.


Architecture Roadmap - John McCalpin

This talk will present a view of IBM's plans for future chip, system, and operating system architectures, and how IBM's integrated product line will evolve with them.

Presentation Materials are here.


Blue Gene and Blue Light: The Return of Massively Parallel Systems! - David Klepacki

Blue Gene and Blue Light are two massively parallel systems being developed at the IBM T. J. Watson Research Center. There are many similarities between the two projects. They are both based on cellular architectures, in which a basic building block is replicated many times according to a regular pattern. They both aim at very high-performance, from hundreds of Teraflops to a Petaflop. They both rely on a host system for management and operation. And they both push the envelope in terms of scalability and sheer level of parallelism exploited. Nevertheless, the two projects are also very different in terms of hardware approach, programming models, and system software organization. During this talk, I will give an overview of the architecture of both systems, present proposed programming models for both of them, and discuss the different technologies that each project exercises.

Presentation Materials are here.


High Performance Computing Roadmap at IBM

Jamshed Mirza
IBM Corp.

This talk will describe IBM's future roadmap in the area of High Performance Computing systems, and how the system hardware and software for these systems will evolve over the next several years. This includes both RS/6000 SP and Linux Clusters.

Presentation Materials are here.


User Presentations

Benchmarking SMP Memory System Performance

Bronis R. de Supinski bronis@llnl.gov
Andy Yoo ayoo@llnl.gov
Lawrence Livermore National Laboratory
Frank Mueller mueller@cs.ncsu.edu
North Carolina State University
Sally A. McKee sam@cs.utah.edu
University of UtahP

Main memory speeds are currently slower than CPU speeds, and this situation will get worse - CPU speeds double approximately every 18 months, while main memory speeds double about every 10 years. As a result, main memory accesses already significantly impact application performance and they will become even more important in the future. Nonetheless, several important issues governing memory system performance in current systems are poorly understood.

Current memory system benchmarks measure important aspects of memory systems. The STREAM benchmark, based on a small set of simple vector kernels, measures sustained main memory bandwidth for unit-stride accesses. Alternatively, Larry McVoy's lmbench and HBenchOS, derived from lmbench by Aaron Brown and Margo Seltzer, measure memory system latencies for all levels of the memory hierarchy; these tests support access patterns with strides greater than one. HBenchOS and lmbench also include memory bandwidth tests, but these are again limited to unit-stride accesses. Although useful for characterizing uniprocessor memory system performance, these benchmarks cannot directly determine key memory system features such as cache sizes, line sizes, and associativity, although many of these can be deduced from their timing results.

Real applications exhibit a variety of access patterns, including non-unit strides and indirect accesses through pointers or indirection vectors. The irregular access patterns produced by indirection are particularly important since they are not well-suited to current memory system optimizations such as caching or prefetching. As noted above, the STREAM benchmark is limited to unit-stride accesses; the lmbench-style tests do support non-unit strides, but they do not test the affect of irregular access patterns.

In addition, symmetric multiprocessors (SMPs) have become the dominant building block of high-performance computer systems. With these systems, several CPUs can access the memory system concurrently. How well the memory system supports the concurrent accesses varies widely - bus-based systems offer little support for concurrent acccesses to main memory, while crossbar-based systems claim to eliminate the bottleneck. Unfortunately, no existing memory benchmarks test the effect of concurrent memory accesses (the STREAM web site does describe how to extend the benchmark with OpenMP to test this issue, but the implementation is not provided).

This talk will present a new set of memory benchmarks that test the effect of irregular access patterns and of concurrent memory system accesses. These tests are derived from HBenchOS and have been parallelized with OpenMP. For increased portability, a Pthreads version is also being developed. In addition, we have implemented a new set of tests, based on hardware performance monitors, that determines key memory system architectural features directly. We will present results of these tests on several platforms, focusing on IBM SP systems available at Lawrence Livermore National Laboratory. Our tests demonstrate that although the crossbar-based memory system of Nighthawk nodes supports concurrent main memory accesses much better than the bus-based memory system of Silver nodes, per-thread memory performance can still degrade significantly.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48, UCRL-JC-145223-abs.

Presentation Materials are here.


MPI On-node and Large Processor Count Scaling Performance

Terry Jones trj@llnl.gov
Linda Stanberry lstanberry@llnl.gov
Lawrence Livermore National Laboratory

Current trends suggest future SP offerings at the high end will have an increasing number of processors per node, and an increasing number of total aggregate processors per SP system. As processor count continues to climb, the scalability of parallel infrastructure such as the MPI collective operations becomes more important. We investigate the performance degradation for groups of shared-memory multiprocessor nodes as the number of CPU intensive threads approaches the number of available processors. We compare results of 14 and 15 MPI tasks per node with that of 16 tasks per node on an SP system comprised of 16-way Nighthawk-2 nodes. Similarly, results for 3 tasks per node are compared with that of 4 tasks per node on 4-way Silver nodes. Operational issues associated with processor binding techniques and priority adjustment techniques are discussed. Performance characteristics are evaluated for variability (noisiness) and optimal performance (fastest time observed). Results are shown for a synthetic benchmark designed to isolate MPI_Allreduce characteristics, as well as a scientific application.

Presentation Materials are here.


Performance Evaluation of Allreduce

Patrick H. Worley worleyph@ornl.gov
Oak Ridge National Laboratory

The allreduce collective operation is an important component of many parallel scientific application codes. We examine the performance of a number of different implementations of allreduce on the Winterhawk II system at ORNL and the Nighthawk II system at NERSC, including the MPI_Allreduce supplied with the native MPI library. We describe how performance varies as a function of vector length and processor count, how important it is to choose the optimal allreduce implementation, and how the optimal implementation varies with how the allreduce is used. We also describe differences between average and best observed performance, and comment on practical implications of this difference.

Presentation Materials are here.


Analysis of the SPEC OMP2001 Benchmarks with Paraver

Jesus LaBarta jesus@cepba.upc.es
Judit Gimemez judit@cepba.upc.es
CEPBA-UPC

In this presentation we will use OMPtrace and Paraver to the analysis of some of the recently available OpenMP SPEC OMP2001 benchmarks.

Rather than performing the measurements and reporting the total elapsed times and speedups for the different benchmarks, our objective is to demonstrate how Paraver can be used to identify which where are the performance problems.

The general process for each benchmark consists in obtaining traces with a basic set of hardware counters for runs with various numbers of threads. A first visual inspection of the Paraver view that displays whether each thread is computing or in the idle loop can easily identify the degree of sequentiality and load unbalance in the program. The views displaying the execution of the compiler generated routines as a function of time is also useful for getting a general inpresion of the program behavior.

A major new feature of paraver is the 2D quantitative analysis capability. This is a very flexible analysis module that sits on top of the very flexible semantic value generation capability of Paraver. It is thus possible to compute and display statistics such as number of instructions executed by each routine, average miss ration within each parallel loop or actual IPC/MFLOPS achieved by each routine. In this way it is possible with Paraver to rapidly obtain much more information than what a typical profling tool would report. The statistics can be obtained for the interesting parts the trace that can be identified through the visual inspection and zooming capabilities of Paraver.

Applying the 2D analysis to traces of different numbers of procesors and comparing the statistics it is possible to identify which part of the code scale well or bad and explanations of why that happens.

Presentation Materials are here.


IMPRINT: A Tool for Performance Monitoring and Application Steering

Tim Kaiser kaiser@sdsc.edu
San Diego Supercomputer Center

We will demonstrate how IMPRINT can be easily used to do performance monitoring and application steering.

Tracing and profiling tools are often difficult to use, involving a steep learning curve that is quickly forgotten. Most ?tool? based tracing is performed as a post processing step - after the code is completed. This can generate large trace files.

Many people choose to use their own profiling tools. They insert print statements into the source code to do the tracing, compile and run the executable. The print statements are later removed prior to production runs of the code.

IMPRINT is a tool to dynamically insert print statements, timing and other routines into running applications. IMPRINT is designed foremost to be easy to use. It is intuitive, providing the look of a source code browser. Users select subroutine calls in the source code browser to instrument. They then select the type of instrument from a menu. The Instrumentation is then inserted into the running application, providing feedback to the user such as entry and exit of a routine, timings, and hardware counter values.

A recent addition to IMPRINT is the ability to do application steering. With IMPRINT users can modify a string that can be read from within their Fortran or C program to change the direction of the simulation.

We will discuss how IMPRINT was built using the IBM DPCL package and ?run anywhere? Java.

Presentation Materials are here.


A User View of the NERSC High Performance Storage System (HPSS)

Thomas M. DeBoni TMDeBoni@LBL.GOV
Nancy L. Meyer u767@nersc.gov
Harvard Holmes HHHolmes@lbl.gov Lawrence Berkeley National Laboratory

Common uses and capabilities of HPSS at NERSC will be presented, including a description of the hardware and networking systems, access and authentication methods, scripting capabilities, and other aspects.

Presentation Materials are here.


HPSS MPI-IO: A Standard Parallel Interface to HPSS File Systems

William Loewe wel@llnl.gov
Lawrence Livermore National Laboratory

HPSS MPI-IO provides an alternative interface to the HPSS Client API library for applications written for a distributed memory programming model using message passing. It coordinates and simplifies parallel access to HPSS files from multiple processes. It makes the functionality of MPI-IO available to HPSS users:

Presentation Materials are here.


HPSS Status and Futures

Randall D. Burris burrisrd@ornl.gov
Oak Ridge National Laboratory

The High Performance Storage System (HPSS) implements storage management for over forty extremely large installations, including ORNL, DOE ASCI sites, high-energy physics experiments, NASA, universities and others. This talk will briefly introduce the architecture of HPSS as a prelude to descriptions of the features of the current release and the plans for feature and performance improvements slated for the next two years.

Presentation Materials are here.


An overview of ASCI Software Pathforward Program

Jeff Brown jeffb@lanl.gov
Los Alamos National Laboratory

To make effective use of increasingly complex ASCI compute platforms, a comparable rapid advance in development and run-time software is necessary. As the scale of ASCI platforms further diverges from the commercial market we can expect that software provided by platform vendors will not be sufficient to meet the needs of the ASCI program. In order to address this concern the ASCI program is making strategic investments in the development of an integrated suite of multi-platform parallel code development tools and run-time systems ensuring that essential core software technology exists for current and future ASCI scale platforms.

In FY99, the ASCI program established a PathForward project, Ultrascale tools, to advance the parallel software development environment and placed a 3 year contract with Etnus to accelerate their parallel debugger development with features and scalability that keep pace with rapid improvements in compute platforms and simulations. In FY00, a PathForward project to accelerate runtime system and performance tool technologies was initiated. Contracts were placed with MSTI and KAI to develop third-party message passing, application threads, and diagnostic technologies.

This session will present and overview of the ASCI software path forward program followed by briefings of all three projects including key accomplishments, current work, and plans through the end of the contracts.

Presentation Materials are here.


ChaMPIon/Pro(TM): A Multidevice MPI 1.2 Implementation for ASCI Blue and White using LAPI and Shared Memory

Anthony Skjellum tony@mpi-softtech.com
Rossen P. Dimitrov rossen@mpi-softtech.com
Andrew Watkins andrew@mpi-softtech.com
MPI Software Technology, Inc.

Bronis de Supinski bronis@llnl.gov Lawrence Livermore National Laboratory

The ASCI Pathforward Ultrascale Tools Initiative has supported the creation of ultrascale MPI implementations, sponsoring MPI Software Technology, Inc in particular to transform its commercial MPI/Pro (R) technology into ASCI-relevant middleware. ChaMPIon/Pro(TM) is the result of the first 18 months of effort under this three-year program. While ChaMPIon/Pro targets all ASCI platforms, this talk concerns itself in particular with support for the IBM-based ASCI Blue and ASCI White Systems. Basic requirements underlying all realizations of this MPI-1.2-compliant software are thread safety, attention to scalability, strong adherence to the MPI-mandated progress rule, and overlapping of communication and computation. Furthermore, ChaMPIon/Pro supports third-party tools such as Etnus' Totalview(R) and Pallas' Vampir(TM). ChaMPIon/Pro is a first-principles, fresh design of MPI that is not burdened by legacies such as being based on early reference implementations of the standard.

Particular attention to the IBM systems was given in this effort. Foremost among these are the creation of a multidevice MPI-1.2 architecture that exploits both LAPI and shared memory. A significant design/evaluation process preceded this implementation, and this contributed to an overall strategy for supporting low overhead, low latency, high-bandwidth communication, with the additional requirement that asynchronous communication progress independently of MPI references by the application, and that overlap of communication and computation be possible for sufficiently large messages.

Among specific design choices are the following. This effort has particularly concentrated on the exploitation of LAPI_Amsend, LAPI counters, and the LAPI header handler completion of short messages. For long messages, this effort has exploited the LAPI_Get functionality. Extremely light weight, short-term locking is achieved for certain mutual exclusion situations through exploitation of the snoopy cache architecture of the Power architecture and appropriate assembly coded routines to enforce serialization when needed. Furthermore, shared-memory architecture utilized does not suffer from an O(N^2) memory requirement for N-way SMP processors, and so is relevant to relatively large-scale SMP deployments.

Findings include that ChaMPIon/Pro exceeds the bandwidth of IBM's MPI for messages in the range of 16K bytes to 256K bytes, using LAPI for cross-box communication. Furthermore, the middleware achieves lower latency than IBM's MPI for short messages in SMP mode. These findings are remarkable in that IBM's MPI works at a lower level than is available through LAPI, which at least means an extra copy compared to their internal interface. Likewise, it is apparently possible for the IBM implementation to save a copy compared to ChaMPIon/Pro, also by accessing non-public services. ChaMPIon/Pro has been subjected to and passes a significant suite of MPI-1.2 compliance tests, together with a suite of ASCI-relevant applications.

This talk mentions future steps with regard to ChaMPIon/Pro on ASCI Blue and White, including the roll out over the next year of topology-aware, highly tuned collective communication, MPI-2 features, better support for third-party performance tools, and further enhancements of MPI-1.2 functionality.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48, UCRL-JC-145222-abs.

Presentation Materials are here.


An Integrated Performance Visualizer for MPI/OpenMP Programs

Don Gunning don.gunning@intel.com
KAI Software division of Intel Americas

Cluster computing has emerged as a defacto standard in parallel computing over the last decade. Now, researchers have begun to use clustered, shared-memory multiprocessors (SMPs) to attack some of the largest and most complex scientific calculations in the world today. To program these clustered systems, people have begun using a combination of MPI and OpenMP. However, analyzing the performance of MPI/OpenMP programs is difficult because, while there are several existing tools to analyze performance of either MPI or OpenMP programs, there is no existing performance analysis tool that integrates performance data from both.

To remedy this situation, KAI Software and Pallas GmbH have partnered with the Department of Energy through an ASCI Pathforward contract to develop a tool called Vampir/GuideView, or VGV. This tool combines the richness of the existing tools, Vampir for MPI, and GuideView for OpenMP, into a single, tightly-integrated performance analysis tool. From the outset, its design targets performance analysis on systems with thousands of processors.

However, systems with small numbers of processors will benefit from the innovations developed for th is new tool, as well. The tool was designed with the idea that analyzing performance on large computer systems should not be materially different from analyzing performance on small systems.

The main goals of the VGV project are to create an integrated MPI/OpenMP performance analysis tool that presents its data effectively, and that scales well to even the largest systems currently available. To accomplish these goals, KAI and Pallas cooperated to synthesize instrumentation and recording of the data. We designed a framework in which performance data is organized hierarchically and parallel processing is used to sort and filter the data.

The integrated tool instruments the program at compile time, generates a trace file at runtime, and does post-run performance analysis and presentation with an integrated combination of Vampir and GuideView that presents data hierarchically. The user can select a region of the displayed data that seems interesting, process the data by sorting and filtering it, then display the results. All performance data is associated with the lines in the source code that produced it, which the user may browse by clicking on the data display. Performance problems such as load imbalance are displayed clearly with the VGV display.

Since VGV is targeted at the largest of systems, the project is addressing scalability constraints, such as the limited space on a display screen, limited disk storage capacity, and human limitations for absorbing large amounts of displayed data. The tool also attempts to limit its impact on the run ning of the program.

VGV is designed to make extensive use of event compression, combination and summarization to limit the size of the trace files generated. It also makes use of multiple, hierarchically-structured trace files. The VGV display uses vertical scrolling, back-to-front stacking of time-lines, and click-activated drill-down into the performance data.

VGV is intially available on the SP and will be extended to other platforms.

Work performed under the auspices of the U.S. Dept. of Energy by University of California LLNL under contract W-7405-Eng-48 by KAI Software, Intel Americas, Inc & Pallas GmbH & Lawrence Livermore National Laboratory.

Presentation Materials are here.


A Technical Update on TotalView 5.0 for the SP

Chris Gottbrath chrisg@etnus.com
Etnus L.L.C.

This will be a introduction to the TotalView debugger highlighting features of the new 5.0 version that are useful in debugging MPI programs on SP clusters. TV 5.0 provides an enhanced user interface, a new graphical message queue display, and new functionality to facilitate debugging multi-threaded applications. Topics covered will also include the traditional strength of the TotalView debugger; parallel barrier and breakpoints, viewing and graphing data distributed across the cluster, patching code into a running program, and working with the contents of MPI message queues.

Presentation Materials are here.


Porting and Performance Evaluation of Scientific Computation Libraries

Shuxia Zhang szhang@msi.umn.edu
Institution: University of Minnsota Supercomputing Institute

Scientific and technical computing is one of the most demanding data processing environments. IBM has long been a leader in this field with its powerful SP system. However, lacking the support for some of commonly used scientific computation libraries still is an existing problem, which prevents large-scale computations from running on the fastest machines. Also it is very necessary to know what kind of numerically-intensive computing tasks can or cannot achieve the expected performance.

This talk will present a series of benchmark runs to address some performance issues that apply the commonly used math and scientific libraries on SP, including some intrinsic functions, the available FFT libraries verse ESSL FFT, the Parallel Algebraic Recursice Multilevel Solvers and PETSc. In addition, we will address the needs of porting the commonly used scientific libraries on SP, especially in the 64-bit distributed parallel regime. We encourage the SP users during the talk to discuss their needs and concerns in the library regards so that IBM can improve their existing products and provide the libraries, which do not exist on SP yet.

Presentation Materials are here.


The Implementation of Asynchronous I/O Servers in NCEP's Eta Model on the IBM SP

Jim Tuccillo tuccillo@us.ibm.com
IBM

The NCEP Eta model is an operational, limited-area, short-range numerical weather prediction model used by the National Weather Service for numerical guidance over the North American continent. The model is parallelized using a domain-decompostion technique and MPI for halo-exchanges. Asynchronous I/O servers have been recently introduced into the model to improve its operational performance. These servers are additional MPI tasks responsible for handling preliminary post-processing calculations as well as performing the I/O of the files for post-processing or model restart. These activities can occur asynchronously with the MPI tasks performing the model integration and effectively reduce the model integration time by overlapping I/O and computation. The asynchronous I/O servers are created at model startup time and the communication between the tasks performing the model integration and the I/O serving is handled through MPI intercommunicators. The overall design, some implementation details of the I/O servers, the impact on NCEP operations, and the overall scalability of the code will be presented.

Presentation Materials are here.


Assessing Performance of Hybrid MPI/OpenMP Programs on SMP Clusters

Edmond Chow echow@llnl.gov
Lawrence Livermore National Laboratory

Assessing Performance of Hybrid MPI/OpenMP Programs on SMP Clusters Abstract: Computational experiences with hybrid message passing and multithreading techniques on SMP clusters generally show poorer performance than pure message passing approaches. This paper attempts to understand the performance of hybrid MPI and OpenMP programs by decomposing and describing performance using four parameters: multithreading efficiency, relative cache efficiency, network interface efficiency, and message passing scaled efficiency. These parameters are used to assess a sparse matrix-vector product kernel, which is typical of many parallel scientific computations, running on an IBM SP computer. Tests with various problem sizes using up to 216 nodes (864 processors) reveal, for example, the benefit of using a hybrid implementation compared to an MPI implementation when the computation uses small messages and is not network bandwidth limited. Otherwise, the MPI implementation generally shows superior performance.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48, UCRL-JC-145332-abs.

Presentation Materials are here.


The Catalina scheduler

Kenneth Yoshimoto
San Diego Supercomputer Center

The Catalina scheduler was developed to serve as a maintainable, extendable external scheduler for LoadLeveler and other resource managers. Support of the GridForm Advance Reservation API was one of the goals of the development effort. Python was chosen as the development language, to encourage readability.

Catalina is similar to the Maui Scheduler in many ways:

Catalina is missing several important Maui (version 3.0.3.7) features, including fairshare, multiple jobs per node, job statistics tracking, and workload profiling.

Catalina is consistent with the GridForum Advance Reservation API, although it does not support all features of the API. Supported features include creating a reservation, modifying a reservation, binding a reservation to a job and canceling a reservation. Unsupported features are two-phase commit, claiming a reservation, and registering a callback function.

Catalina has several other distinguishing features. The scheduler supports shortpools. These are sets of nodes guaranteed to be available before a set lag time. Catalina allows the use of arbitrary Python code for filtering of jobs allowed to use a reservation or for filtering nodes considered for use in a reservation.

Performance and functionality has proven adequate in production on the SDSC Blue Horizon IBM SP.

Presentation Materials are here.


The Tool Gear Infrastructure Project

John Gyllenhaal gyllen@llnl.gov
John May johnmay@llnl.gov
Brock Wilcox wilcox10@llnl.gov
Lawrence Livermore National Laboratory

The goal of the Tool Gear Infrastructure project is to provide the gear for building new ASCI-scale parallel tools quickly. Tool Gear provides a set of components that are typically needed by small to moderate data volume tools.

This includes:

An overview of the Tool Gear modules under development will be presented, as well as our experiences with them on LLNL's IBM SP systems. It is our intent to release Tool Gear as open-source software when completed.

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48, UCRL-JC-143800-abs.

Presentation Materials are here.


Using Hardware Performance Monitoring for Application Performance Tuning on the IBM Power3

Shirley Moore shirley@cs.utk.edu
Philip Mucci mucci@cs.utk.edu
Daniel Terpstra terpstra@cs.utk.edu
University of Tennessee

PAPI is a cross-platform interface to hardware performance counters found on most modern microprocessors. The reference implementation of PAPI consists of a platform-independent portion and a number of platform-dependent substrates. PAPI includes a proposed standard set of events, of which as many as possible are mapped to native events (sometimes to a combination of native events) available on a given platform. PAPI also provides access via standard routines to native events and counting modes. In addition, PAPI provides the functionality of user callbacks on counter overflow of a threshold and SVR4-compatible profiling based on any counter event. Where not supported by the operating system (as on the IBM), PAPI implements software multiplexing of counter events, thus allowing more events than the number of physical counters to be counted simultaneously. Where the required support from the operating system is available (as on the IBM), PAPI supports virtualized per-thread counters for threads packages such as OpenMP and Pthreads.

Presentation Materials are here.