SCICOMP5 Abstracts and Presentation Materials


Meeting Home | Agenda | Registration | Presentations | Local Information | Planning Commmittee | Contact Us | Related Links

Abstracts and Presentation Materials for the Tutorials and Talks

Tutorials

One day of the meeting (date to be announced) will be dedicated to tutorials which will be presented by IBM stafff in the same rooms as the regular sessions.
Tutorial abstracts are provided below.

Presentations

The regular sessions will last two and a half days.
Presentation abstracts, given below, are linked to their respective items in the meeting agenda. Aside from the Keynote Address, abstracts in this section are divided into two sets: those by IBM staff and those by SP system users.

Presentation Materials

Presentation materials that are provided will be linked to the abstracts below.


Tutorial Abstract

Tutorial 1: Optimization and Tuning on Power 4 Systems
Charles Grassl, Bob Blainey, and Luiz Derose, IBM


The pSeries model 690 ("Regatta") and recently introduced model 670 systems use the POWER4 processor. Though the processor architecture and instruction set have few changes, the overall system design is much different from previous POWER3 based systems.

The POWER4 systems have the same programming paradigm as used on POWER3 systems: both shared memory and distributed memory, or threads and tasks. The POWER4 systems have an extra level of cache and another level of memory hierarchy. These system features, along with more shared resources, have subtle on programming techniques.

In this tutorial we will discuss both the architectural features and how to exploit them in programming techniques. We will also provide a thorough review of the performance features of the latest compilers and how they cbe used for maximum impact on POWER4.

Presentations materials are here:

"Performance Programming with IBM pSeries Compilers" in Adobe PDF Format
"Optimization and Tuning on Power 4 Systems" in Adobe PDF Format


Presentation Abstracts

Keynote Address:

Prof. P. J. Durham, Director of the Computational Science and Engineering Dept., CLRC

Presentations materials are here:


IBM Presentations

IBM Scientific and Technical Computing Roadmap
Carol Crothers, IBM


Presentations materials are here:


MPI Update
Dick Treumann, IBM


Perhaps the most valuable extension to MPI provided by MPI-2 is MPI-IO.

MPI-IO provides the programmer with techniques to declare the file IO an application will use in a way which can be exploited by the MPI implementation to improve IO performance. Much of the value of MPI-IO depends on well designed use of MPI_Datatypes as fileviews. I will present an overview of how the MPI-IO API is designed to exploit fileviews and how the IBM implementation of MPI-IO is structured.

After presenting this overview of MPI-IO I will give some suggestions for using MPI-IO effectively. Most of the suggestions will apply to any MPI-IO implementation but there will also be a discussion of IBM hints for MPI-IO.

Presentations materials are here: Adobe PDF Format


Production Tools Update
Dave Wootton, IBM


This talk will cover DPCL and the tools in the latest Parallel Environment release. DPCL is a C++ class library which provides the infrastructure for building application development tools. The tools in Parallel Environment include tools for tracing an MPI application and for profiling applications.

Presentations materials are here: Adobe PDF Format


ACTC Tools and Libraries
David Klepacki, IBM ACTC


This presentation will highlight some of the application development tools for MPI and SHMEM applications. The emphasis will be on TurboSHMEM. Running Cray SHMEM applications on the IBM SP has always been a challenge. Due to the very different semantics of the MPI-2 Put/Get specifications, it has not been possible to port from the SHMEM API efficiently using MPI. However, the IBM LAPI interface is ideal for this purpose. IBM LAPI is a zero-copy, one-sided communication layer that is part of the IBM PSSP software. It is the lowest layer of communication software that is device-independent and open to the SP programmer. A complete implementation of the Cray SHMEM API has been developed for the IBM SP based upon LAPI. The "turbo" qualification is meant to emphasize that optimizations have been included to exploit the shared memory on the SP nodes. This presentation will discuss some of these optimizations and its impact on application performance.

Presentations materials are here: Adobe PDF Format


The HPM Toolkit: An Interface for Hardware Measurements on IBM Systems
Luiz Derose, IBM ACTC


In this talk I will discuss the AIX interface to access the hardware performance counters on the Power4 and present the Hardware Performance Monitor (HPM) Toolkit, which was developed for hardware performance measurements of applications running on IBM SP, Regatta, and Linux systems. The HPM Toolkit is an easy to use interface for data capture and analysis that supports serial and parallel applications, written in Fortran, C, and C++. It was designed to collect hardware events with low overhead and minimum measurement error, and to display a rich set of derived metrics. These derived metrics allow users to correlate the behavior of the application to one or more of the components of the hardware.

Presentations materials are here:


Dynamic Logical Partitioning on AIX Power 4 Systems
Stephen Peckham, IBM


In December 2001, IBM introduced Power 4 SMP machines, which provide new capabilities, such as being able to partition the resources of the machine into multiple, logical systems. In this presentation, I will discuss enhancements to AIX on Power 4 systems, such as dynamic logical partitioning (DLPAR), memory affinity, capacity upgrade on demand, and large page support.

Presentations materials are here: Adobe PDF Format


Ab-initio Material modeling in IBM Research: extending the size limit via code optimization
Alessandro Curioni, IBM Zurich


Presentations materials are here:


High Performance Computing Roadmap at IBM
Jamshed Mirza, IBM


HPC systems based on IBM's Power4/AIX and Linux Clusters have been widely deployed over the last year and are productively being used. The talk will briefly discuss the outlook for HPC systems from IBM over the next few years. While the primary focus will be on the pSeries/AIX line of systems, we will also discuss Linux systems for HPC.

Presentations materials are here:


IBM Compiler Update
Bob Blainey, IBM


The IBM compiler group is in the final stages of releasing VisualAge C/C++ 6.0 and XL Fortran 8.1. Both of these are major releases, bringing together new language function, new performance advances and full support for the new Power4-based pSeries servers. I will describe the new release content with particular focus on performance issues, including new compiler features such as OpenMP 2.0, C99, and Fortran 200x as well as enhancements to optimization capabilities. I will also provide a peek into the work going on for the following releases.

Presentations materials are here: Adobe PDF Format


File Systems Update
Dave Craft, IBM


This session will cover General Parallel File System status and directions for AIX SP, AIX HACMP, and Linux platforms.

Presentations materials are here: Adobe PDF Format


High Performance Computing at IBM: AIX, Linux and the Grid
David Klepacki, IBM ACTC


This presentation will outline IBM's directions in High Performance Computing, including MPP trends as well as distributed computing via the Grid. The emphasis will be on cellular architectures, such as the popular systems code-named "Blue Gene". Such architectures integrate processor, memory, and communication fabric into a single computational "cell" that can be replicated, in principle indefinately. The first prototype machines of this research will consist of tens of thousands of processor cells, easily capable of achieving total floating-point performance in the range of 50 - 300 Teraflops. The machines are designed to use standard programming models, such as MPI message-passing, along with standard high-level programming languages such as Fortran and C/C++. Many scientific disciplines will benefit from this computing architecture, including fluid dynamics, chemistry, materials science, engineering, and the life sciences.

Presentations materials are here: Adobe PDF Format


User Presentations

Hybrid Monte Carlo Parallel Algorithms for Matrix Computation
Vassil Alexandrov, University of Reading, Department of Computer Science

We consider a hybrid (fast stochastic approximation and deterministic refinement) algorithms for Matrix Inversion (MI) and Solving Systems of Linear Equations (SLAE). Monte Carlo methods are used for the stochastic approximation, since it is known that they are very efficient in finding a quick rough approximation of the element or a row of the inverse matrix or finding a component of the solution vector. In this paper we show how the stochastic approximation of the MI can be combined with a deterministic refinement procedure to obtain MI with the required precision and further solve the SLAE using MI. We employ a splitting A = D - C of a given non-singular matrix A where D is a diagonal dominant matrix and matrix C is a diagonal matrix. In our algorithm for solving SLAE and MI different choices of D can be considered in order to control the norm of matrix T = D^{-1} of the resulting SLAE and to minimize the number of the Markov Chains required to reach given precision. Experimental results with dense and sparse matrices are presented. The algorithms run on an IBM SP 3 machine in Reading.

Examples drawn from various applications such as Air Pollution Modelling and Information Retrieval are presented.

Presentations materials are here:


Strategies for Scaling Parallel I/O on the IBM SP
David Skinner, Berkeley Lab, NERSC


This presentation details several strategies for moving data between parallel applications and disk on NERSC's IBM SP, seaborg.nersc.gov. A quantitative comparison and summary of strategies for parallel I/O are provided. Issues of scaling with respect to total data size and number of concurrent tasks are addressed.

The first section outlines the hardware and software configuration specific to NERSC. The next section presents the implementation goals and constraints. The following sub- sections present strategies to accomplish these goals. Each description includes an overview and example of source code for writing and reading files. The last section provides conclusions.

Presentations materials are here: PowerPoint Source


Programming next generation HPC systems: the mixed-mode model
Lorna Smith, Mark Bull; EPCC, Edinburgh University


Combining MPI and OpenMP to program SMP clusters is an obvious strategy, as it combines portability with the potential for exploiting the hardware architecture to its maximum potential. In this talk, the advantages, disadvantages and pitfalls of this strategy will be discussed and illustrated with real examples. The characteristics of both application and architecture which determine the success of the strategy will be identified, i.e. will it deliver better performance than a pure MPI implementation? For example, benefit may be obtained if the MPI implementation suffers from: poor scaling with MPI processes due to load imbalance or too fine a grain problem size, memory limitations due to the use of a replicated data strategy, or a restriction on the number of MPI processes combinations.

Presentations materials are here: PowerPoint Source


Saudi Aramco's Experiences with IBM SP Clusters
Yasir A. Rafie, Anthony A. Sirtautas, Marvin H. Arens, Joseph Cignoli, Michael J. Critchley, Mohammed A. Rashed, Abdulaziz S. Sharikh, Abdulaziz A. Buali, Hameed A. Hussain; Saudi Aramco, Geophysical Applications Division Seismic Processing Support Group


Saudi Aramco has been processing seismic data on supercomputers since the early 1980's. It has seen a transition from the various Cray supercomputers, through different types of IBM SP nodes. Saudi Aramco is currently listed 18th in the Top500 Super Computer sites maintained by www.top500.org. It's currently holding the rank of 47th with its 32x16 CPU (total 512CPUs) NighthawkII IBM SP Cluster. Also installed, is more than 441TB of online disk space, directly accessible by each node on the cluster using GPFS.

The presentation will start with an overview of how supercomputers used in Saudi Aramco for seismic processing evolved to present day. A brief comparison of performance figures between the different systems will be shown.

A more detailed description of the current IBM SP cluster will be shared. Experiences on how the current production system is utilized, from an application point of view, using key technologies like GPFS, LoadLeveler and MIO, will be highlighted. Issues that have been encountered and resolved and current outstanding issues will be discussed.

As a wrap up, future activities will be touched upon.

Presentations materials are here: PowerPoint Source


An Operational Parallel Weather Prediction System
Mats Hamrud, David Dent; ECMWF


The Integrated Forecasting System (IFS) provides the software for operational weather forecasting and observational analysis at ECMWF. It must run efficiently over substantial numbers of processors in a time critical way. An overview of the software will be presented together with some indications of its parallel efficiency when executing on several computing platforms.

Presentations materials are here: PowerPoint Source


Computational Chemistry Applications: Performance on High-End and Commodity-class Computers
Martyn Guest, Daresbury Laboratory


Commodity-based clusters, on face value, offer the potential of a viable cost effective alternative for the provision of High Performance Computing. In this paper we compare the performance of a variety of clusters built from commodity "off the shelf" components in the support of major research and production codes, with current high-end hardware such as the IBM SP, Compaq AlphaServer SC and SGI Origin 3800. The results concentrate on the application area of computational chemistry.

Benchmark data on nine commodity-based systems (CS1-CS9) featuring Intel IA32 and IA64, AMD Athlon and Alpha CPU architectures coupled to traditional Beowulf interconnect, such as Myrinet and Ethernet, are presented. Furthermore, we provide performance data on systems utilising both the Quadrics QSNet and SCALI SCI interconnect technology, together with initial results from the IBM SP/Regatta-H.

Presentations materials are here: Adobe PDF Source


Study of OpenMP with MPI for ECMWF's production weather code (IFS) on IBM Nighthawk2
John Hague, IBM UK; Deborah Salmond, ECMWF


IFS was designed to be run on a massively parallel distributed memory system using MPI. More recently, OpenMP was introduced to enhance the possibility for parallelisation on systems comprising shared memory nodes. Where possible. OpenMP directives have been put at a high level, which results in very efficient parallelisation. However, many small parallel regions have also been added to ensure as much as possible of the IFS is inside an OpenMP parallel region. Some of these smaller parallel regions do not give very good scalability.

We have found that when running the mixed MPI/OpenMP version on IBM Nighthawk2, the MPI performance is very good, and so this study concentrates on factors which influence the OpenMP scalability.

It was found that when running two or more OpenMP threads the OpenMP scalability was not as great as expected. In fact it was often better to increase the number of MPI tasks rather than the number of OpenMP threads. This fact may come as a surprise to many people and this paper investigates the reasons for this less than optimal OpenMP scalability.

The main areas which are considered are :

Each of these reasons is investigated using simple test programs, and compared with the actual measurements obtained from IFS using the hardware performance monitor.

The paper concludes that there are easily understandable reasons for the OpenMP scalability being less than optimal and that, for IFS, increasing the number of MPI tasks can be a better way of achieving greater parallelisation.

Presentations materials are here: PowerPoint Source


On the effect of environment variables
Jesus Labarta, Judit Gimenez, Jordi Caubet; CEPBA-UPC


The actual behavior of the MPI implementation is controlled by a bunch of environment variables and the actual network used. A very good presentation of the concepts behind such environment variables was presented at SciComp4. A typical way to quantify the performance inplications of such setting is to run an application with different configurations and report the elapsed times. To somehow measure the effect of individual environment variables microbenchmarks that spedifically stress that feature are frequently used.

The objective of this talk is to analyze and try to understand in detail the actual effect of different settings of such environment variables on the behavior of real applications rather than microbenchmarks. The analysis will be performed with Paraver and will be focused on the detailed implications for individual communication requests instead of just reporting a global elapsed time.

The methodology used consist in running the applications (sweep3d, NAS benchmarks,...) under different environments and analyze in detail aspects such as the distribution (histogram) of time duration of the different MPI calls. A study of the effect of environment variables on hardware counter derived metrics (instruction counts, cache misses,....) will be presented.

The analysis will extensively rely on the newly developed 2D analysis capability of Paraver.

Presentations materials are here: Adobe PDF Source


TotalView on the SP: debugging in the large
James Cownie, Etnus LLC


This talk will discuss TotalView on the SP. In this talk I discuss the experiences gained in debugging codes on large SP machines such as ASCI Blue and ASCI White, and the improvements which have been and continue to be made to TotalView to support such large scale machines.

I also describe the new features of TotalView 5.1 which is due for release later this year.

Presentations materials are here: PowerPoint Source


Early Evaluation Results from the p690 System at Oak Ridge National Laboratory
Patrick H. Worley, Thomas H. Dunigan; Oak Ridge National Laboratory


We describe the p690 system at ORNL and results from our initial evaluation studies. Data from both standard and custom benchmarks are used to examine serial, per node, and multiple node performance. We in particular look at the effect of memory bandwidth on performance, and contrast full 32-processor "turbo" nodes, nodes that have been partitioned into four 8-processor LPARs, and turbo nodes that are used like 16 processor "HPC" nodes. We finish with a comparison between systems from IBM and other vendors using applications drawn from global climate modelling, fusion, and astrophysics.

Presentations materials are here: HTML Files


Some peculiarities in the memory subsystem of the IBM pServer 690
Ulrich Schwardmann, GWDG Göttingen


The memory subsystem of the IBM pServer 690 has four levels. All components except the Level1-cache are shared ressources between the processors. This memory hierarchy and its complicated interdependences has of course impact on the performance of code - serial as well as parallel. On a system with an unbalanced memory configuration for instance on processors without direct memory access in special cases one can get better performance figures than on processors that are connected directly to memory. The talk will present such peculiarities of performance figures of serial and parallel basic kernel loops depending on the vector length - and will try to explain them.

Presentations materials are here: HTML Files


Performance Evaluation of KISTI Out-of-core Sparse Solver
Jeong Ho Kim, Minsu Joh, Sangsan Lee; KISTI (Korea Institute of Science and Technology Information)


A high-performance out-of-core sparse solver is being developed by KISTI supercomputing center, Korea. This solver is based on element or domain-wise multifrontal algorithm and designed to solve huge finite element analysis problems. Out-of-core technique is extensively used to solve as large problems as possible within given physical memory size. Furthermore, distributed-memory parallelization is performed to make it possible for this solver to solve much larger problems which do not fit in single physical memory. In this presentation, this solver and our target applications for the solver will be introduced briefly. Then, the results of serial and parallel performance evaluation on various systems including POWER4 will be presented. Analysis of the performance results will also be followed.

Presentations materials are here:


Porting scientific codes to the IBM Regatta: initial experiences
Renate Dohmen, Computing Centre Garching of the Max-Planck-Gesellschaft


The Computing Centre Garching provides computing, archiving and network services for a great number of users from various disciplines of basic research. The high computational demands of the users afforded to go into the so-called high-performance computing business quite early. Since about 10 years the computing centre has besides scalar and vector machines also parallel computers. In 1995, a large Cray T3E with more than 800 processors has been installed. Meanwhile all important codes have been parallelized and are running with great success on the T3E. For reasons of portability many of them use MPI as communication library, some of them, however, use Cray's SHMEM library to achieve optimal performance. Similarly, many codes use portable numerical libraries as BLAS, LAPACK, ScaLAPACK and NAG, some, however, the native Cray scilib for one or another reason.

Now that the Cray is getting on in years a new machine was bought as successor. This machine is an IBM Regatta, a cluster of Power-4 SMP nodes, each with 32 processors. We started with a small machine within the framework of an early shipment program in October 2001, the present system, which has been installed in January 2002, consists of 6 nodes, but the final system will be even larger. Now work is going on to port codes from other machines, above all from the Cray T3E, to the IBM Regatta. The experiences gained with porting scientific codes are as diverse as the codes themselves are. The planned presentation will report our initial experiences with respect to performance, programming environment and further topics being of significance for the acceptance of the machine among our users.

Presentations materials are here:
        UU-Encoded GZipped file of PostScript slide files
        HTML file containing links to individual PostScript slide files