SCICOMP10 Abstracts and Presentation Materials

Abstracts and Presentation Materials for the Tutorials and Talks

Tutorials

Presentations

Presentation abstracts, given below, are linked to their respective items in the meeting agenda. Aside from the Keynote Address, abstracts in this section are divided into two sets:

Presentation Materials

Presentation materials that are provided will be linked to the abstracts below.


Tutorial Abstracts

Tutorial 1: Compilers, Message Passing and Optimization for POWER4 and HPS (Federation) Systems

Roch Archambault, IBM
Charles Grassl, IBM
Pascal Vezolle, IBM

optimization

resources

architecture

HPS

compilers

IBM recently introduced the POWER5 processor, which is the successor of the POWER4 processor. This new processor is very similar to the previous POWER3 and POWER4 processors, but has several important difference and new features.

The memory hierarchy of the POWER5 systems is changed: the third level of cache is shared only by pairs of processors on a chip. The programming effect of this design change is that "blocking" sizes should be larger.

The POWER5 processor systems retain many features from POWER4 systems, including multiple page sizes and Logical partitioning. New features include Simultaneous MultiThreading (SMT). This feature allows run two processes simultaneously on the same processor.

The latest switch product, name High Performance Switch (HPS), also known as "Federation", is replacing the prefious switch product, named SP Switch 2, also know as "Colony". The HPS has very different performance characteristics from the previous switches. The HPS has more than five times higher bandwidth than the previous switch and the latency is reduced by a factor of three.

We will describe the latest C and Fortran compilers and their specific features relevant to the POWER5 processor. We will also describe the performance optimization facilities available in the C and Fortran compilers and the most effective tactics for leveraging them.

We will follow this with an overview of optimization techniques which will exploit the features of the POWER5 processors. We will also discuss the use, exploitation and ramifications of memory affinity, large memory pages ans SMT.

We will also discuss message passing using the new HPS switch. The new switch has slightly different tuning characteristics from previous pSeries switches and this tuning involves several new environment variables.


Tutorial 2: GPFS: Programming, Configuration and Performance Perspectives

Ray Paden , IBM

talk intro

talk

GPFS is a mature, robust, parallel file system available on IBM systems running either AIX or Linux. It supports the simple to use POSIX I/O API in a manner that produces superior performance when it is adroitly used and configured, but also provides extensions to the POSIX API to address selected challenging performance issues. This presentation will examine GPFS features useful to the HPC applications programmer, including coding examples. It will also examine configuration alternatives that yield varying performance profiles and scaling vs. cost alternatives. The results of several benchmark studies will be presented to illustrate these various items.

Tutorial 3: "Deploying Linux Based HPC Clusters"

Robin Goldstone, LLNL and
TBD, IBM

talk materials

For the past several years, LLNL has invested significant resources in the design and deployment of Linux-based HPC clusters. This tutorial will cover what it takes to make a large Linux cluster work well at a supercomputer center, and the unique capabilities Linux and open source bring. The presentation will highlight the significant aspects of LLNL's HPC Linux cluster solution including:

A matrix of Linux tools and capabilities for xSeries and pSeries platforms will be presented and compared. A demonstration of CHAOS cluster management tools will be provided.

Tutorial 4: "Bridging Programming Langauges with Babel"

Tom Epperly, LLNL

talk materials

Babel enables arbitrary mixing of C, C++, Fortran77, Fortran90, Java, and Python at maximum performance for scientific computation. This means languages are mixed in the call stack of a single executable: no messaging, no data copying, and no interpreted middleware. Far from a lowest common denominator solution, Babel actually adds features like polymorphism, exception handling, dynamic loading, and efficient multi-dimensional arrays to languages that don't support them natively. Our Scientific Interface Definition Language (SIDL) defines the object model that Babel supports uniformly across languages.


Presentation Abstracts

Keynote Address

Dr. John R. Boisseau

Director Texas Advanced Computing Center
University of Texas at Austin

Integrating Resources from Personal Scale to Terascale for Scientific Research

talk

The transition of supercomputers from being predominantly Crays to more modular systems such as IBM SPs and eventually Linux clusters removed the huge gap between low end systems capabilities and high end capabilities, enabling a spectrum of computing power based on workload requirements (and budgets). This transition instigated the philosophy of "from desktop to teraflop", in which researchers would have access to computing systems of various capabilities in their offices, departments, institutions, and national centers, and be able to choose the appropriate resource for each particular task. The advent of grid computing then stimulated a vision of integrating this diverse range of computing power more tightly to simplify and automate tasks and to increase throughput and capability through resource scheduling and coordination. However, while there is much hype in the field of grid computing, this ultimate vision has not yet been achieved consistently. There are still relatively few production computing grids, and most are focused on a narrower range of computational systems (i.e., all PCs in idle cycle harvesting grids, or all high-end systems in projects such as TeraGrid) or are more focused on data management than metascheduling.

The Texas Advanced Computing Center (TACC) and IBM have partnered to develop, deploy, and operate a comprehensive campus cyberinfrastructure -- UT Grid -- to tackle the most challenging issues in this vision of grid computing. The goals of this project include integrating the 'personal scale' systems 'desktops and laptops of researchers and educators' as interfaces to and components of a production campus grid with diverse resources, ranging from these PCs to lab workstations to departmental servers and clusters to the terascale systems at TACC the entire spectrum of campus computing capabilities. Furthermore, the project will then also integrate data storage systems, data collections, visualization systems and displays, and instruments into this campus grid. Finally, the project will explore new research and education models for using this diversity of resources, including using terascale resources from personal scale devices in labs and classrooms.


IBM Presentations

IBM Keynote Address

Peg Williams, IBM

Advanced Computing Center
University of Texas at Austin



Presentation Materials are here.


User Presentations

Info will be provided as it becomes available. Accepted submissions will be posted in late July.

Early experiences with Datastar
Giridhar Chukkapalli

talk materials

In this talk, We will report initial experiences with Datastar, a 176 node power4 system with Federation interconnect. We will present both well accepted benchmark results as well as full scale SDSC applications. We will compare single processor and parallel performance of Datastar with its predecessor Power3 based Bluehorizon. Preliminary results indicate that users can expect between 2.5 and 4 fold (per processor) performance improvement compared to Bluehorizon.


Federation Performance on the p690 cluster at ORNL
Patrick H. Worley

talk materials

The Federation interconnect was installed on the p690 cluster at Oak Ridge National Laboratory early in 2004. We describe both communication microkernel and communication-sensitive application code performance on this cluster using the June GA Federation microcode. Where available, performance is compared with performance for earlier versions of microcode and with performance using the Colony interconnect on the same cluster of p690s.


Memory Debugging with TotalView
Chris Gottbrath

talk materials

Memory Debugging with TotalView on AIX and Linux-Power Abstract: This talk will discuss the recent addition of memory debugging functionality to the Etnus TotalView debugger. TotalView gives AIX and Linux-Power user a new way to debug memory problems -- in a single process or across an entire cluster. This talk will introduce the architecture and capabilities of Etnus's memory debugging technology and how it is applied in the cluster context.


Scalable High-Performance and Parallel Implementation of Out-of-Core QR and LU Factorizations
Brian Gunter

talk materials

We discuss new algorithms for the QR and LU factorizations of dense matrices when these matrices are sufficiently large that they must be stored on disk. Unique to our approach are modifications to the standard QR factorization and LU factorization with partial pivoting that allow main memory to be filled with tiles (square submatrices) rather than the more customary slabs (blocks of columns). Slab based algorithms suffer from the fact that as matrices become larger the number of columns that can fit in memory becomes smaller, which adversely affects performance. Tile-based algorithms were previously only proposed for the Cholesky factorization and for LU factorization with limited pivoting (which affects stability). The new approaches retain similar stability characteristics as the standard in-core algorithms while simultaneously improving scalability. Experimental results, as well as applications to Earth science, from high-performance sequential and parallel implementations are presented to support the theory. This research suggests that the expense of high-performance sequential and parallel systems can be lowered by reducing the amount of memory required for these kinds of problems. This is particularly important for embedded systems and/or systems that reduce power consumption. Performance on IBM and other systems will be presented.


Dense Linear Algebra Libraries
Robert van de Geijn

talk materials

We present a new paradigm for the development and implementation of high-performance linear algebra libraries with functionality similar to the BLAS and LAPACK. Traditional approaches are inherently evolutionary for a number of reasons. For example correctness of the libraries is established through extensiv e (exhaustive) testing and the test of time. If changes are to be made to such libraries, they should be incremental so that this correctness is inherited by t he new libraries. As a result, adding new functionality to these libraries and/or targeting new architectures is an expensive task, since changes must adhere to a philosophy that has been inherited from the age of LINPACK and EISPACK. Our project takes a fresh (revolutionary) look at such library development. We have shown that given a mathematical description of the operation to be computed many algorithmic variants for computing the operation can be systematically and even automatically derived. The derivation provides a constructive proof of correctness. We have developed APIs for coding these algorithms so that the correctness of the algorithm implies the correctness of the implementation. These APIs encompass Matlab, sequential C, as well as distributed memory parallel implementations. By choosing the most appropriate algorithm for an architecture or level of the memory hierarchy, performance that exceeds traditional LAPACK implementations can be achieved.


Survey of MPI Call Usage
Terry Jones

talk materials

The MPI specification provides a rich set of function calls to perform various message- passing operations. We have collected MPI usage information for a dozen applications common to the LLNL computer center through the use of the profiling library mpiP. This talk will present results from our effort including which calls are found most often, which calls consume the largest percentage of time, and which message sizes are most prevalent.


Dense Linear Algebra Libraries
Robert van de Geijn

talk materials

We present a new paradigm for the development and implementation of high-performance linear algebra libraries with functionality similar to the BLAS and LAPACK. Traditional approaches are inherently evolutionary for a number of reasons. For example correctness of the libraries is established through extensiv e (exhaustive) testing and the test of time. If changes are to be made to such libraries, they should be incremental so that this correctness is inherited by t he new libraries. As a result, adding new functionality to these libraries and/or targeting new architectures is an expensive task, since changes must adhere to a philosophy that has been inherited from the age of LINPACK and EISPACK. Our project takes a fresh (revolutionary) look at such library development. We have shown that given a mathematical description of the operation to be computed many algorithmic variants for computing the operation can be systematically and even automatically derived. The derivation provides a constructive proof of correctness. We have developed APIs for coding these algorithms so that the correctness of the algorithm implies the correctness of the implementation. These APIs encompass Matlab, sequential C, as well as distributed memory parallel implementations. By choosing the most appropriate algorithm for an architecture or level of the memory hierarchy, performance that exceeds traditional LAPACK implementations can be achieved.


Modelling Core-Collapse Supernovae
Eric Myra

talk materials

A long-standing unresolved problem in astrophysics is a theory that explains the origin of heavy elements in the universe. Stellar evolution theory can explain the processes by which nuclides of carbon and heavier are formed. Analysis of supernova light curves strongly suggests that supernovae are responsible for the relative abundances of these nuclides throughout the universe. However, a convincing theory has remained elusive as to how supernova explosions eject material free of the gravitational pull of a star.
For some 40 years, this "supernova problem" has challenged computational physicists and pushed the limits of the highest performance computers of the day. Recently, supported by a Department of Energy SCIDAC grant, our collaboration has developed scalable iterative solvers for radiation hydrodynamics that allow scaling to 1000s of processors. This allows, for the first time, multidimensional supernova models to be computed using state-of-the-art physics.
I will present some recent results of our research effort at Stony Brook performed on the POWER3 system at NERSC, emphasizing the computational advances that have enabled this progress. I will also discuss our current challenges and anticipated needs for both high-performance computing hardware as well as application and systems software.


Mpipview: an MPI performance profile viewer
John Gyllenhaal

talk

Mpipview is a new easy-to-use GUI interface for mpiP, a lightweight, scalable, portable, and freely available MPI profiling library. This talk will provide an brief overview of mpiP's features, describe the features of the new GUI interface, and some plans for future work.


MPI Application and Library Performance
David Skinner

talk materials

This talk treats a variety of topics related to deploying and optimizing MPI based applications on the IBM SP. Information on application performance, variability in performance, and memory usage is presented within the context of codemicrokernels and a few selected applications. Comparisons of different MPI libraries are presented as is initial work done to characterize the diverse scientific workload currently running at NERSC.


On optimizing Collective communication
Avi Purkayastha

talk

In this talk we will discuss issues related to the high performance implementation of collective communication operations on a distributed-memory computer architecture. Using a combination of known techniques and algorithms along with careful exploitation of MPI communication modes, we have developed implementations that have improved performance for most situations, when compared to those currently supported by MPICH. We show initial results obtained on Intel Pentium 4 ® processor cluster.


Experience with a large HPS Regatta System
Klaus Wolkersdorfer

talk materials

Early 2004 Forschungszentrum Juelich (FZJ) has installed a 41 node p690+ cluster with the High Performance Switch (HPS). This talk will focus about the experiences with bringing this system to the latest level of the HPS (PTF 7). The talk will also contain (new and old) measurements regarding the performance of the HPS (latency and bandwidth) as well as some figures from selected applications.


DPOMP: An Infrastructure for Performance Monitoring of OpenMP Applications
Bernd Mohr

talk materials

Unlike MPI which includes a standard monitoring interface (PMPI), OpenMP does not provide yet a standardized performance monitoring interface. In order to simplify the design and implementation of portable OpenMP performance tools, Mohr et. al. [1] proposed POMP, a performance monitoring interface for OpenMP. This proposal extends experiences of previous implementations of monitoring interfaces for OpenMP [2][3][4].

In this talk we present DPOMP, a POMP instrumentation infrastructure based on dynamic probes. This implementation, which is built on top of DPCL, is the first implementation of POMP based on binary modification, instead of a compiler or pre-processor based one. The advantage of this approach lies in its ability to modify the binary with performance instrumentation, without requiring access to the source code or re-compilation, whenever a new set of instrumentation is required. This is in contrast to the most common instrumentation approach, which augments source code statically with calls to specific instrumentation libraries. In addition, since it relies only on the binary, this POMP implementation is programming-language independent. DPOMP takes as input an OpenMP application binary and a POMP compliant performance monitoring library. It reads the application binary, as well as the binary of the POMP library and instruments the application binary, so that, at locations which represents events in the POMP execution model the corresponding POMP monitoring routines are called. From the user's point of view, the amount of instrumentation can be controlled through environment variables which describe the level of instrumentation for each group of OpenMP events as proposed by the POMP specification. From the tools builder point of view, instrumentation can also be controlled by the set of POMP routines provided by the library, i.e., instrumentation is only applied to those events that have a corresponding POMP routine in the library. In addition, DPOMP supports instrumentation of user functions, as well as MPI functions.

In the presentation, we will first briefly describe the main DPCL features, as well as the IBM compiler and run-time library issues that make our dynamic instrumentation tool for POMP possible. Then, we will exemplify how users can build their own performance monitoring libraries and will present two POMP compliant libraries: POMPROF and the KOJAK POMP library, which provide respectively the functionality for profiling and tracing of OpenMP applications.