SCICOMP3 Abstracts


Home | Agenda | Registration | Presentations | Local Information | Planning Commmittee | Contact Us | Related Links

Abstracts of the Tutorials and Presentations

Tutorials

The first day of the meeting will be dedicated to tutorials which will be presented by IBM stafff in the same rooms as the regular sessions. Tutorial abstracts are shown below.

Presentations

The regular sessions will last two and a half days. A draft schedule is shown below, but it subject to change.
Presentation abstracts, below, will be linked to the presentation schedule in the meeting agenda. Abstracts in this section are divided into two sets: those by IBM staff and those by SP system users.


Tutorial Abstracts

Tutorial 1: Performance analysis of parallel programs on the SP

by Jesus Labarta, CEPBA, and Luiz Derose, IBM

The tutorial will be focused on training attendees on the use of two tools for performance analysis of parallel programs on SP systems:

Both tools support MPI, OpenMP and mixed model programs. It is assumed that attendees have some experience in the development of parallel programs and specially on the concepts of MPI and OpenMP.

The full day tutorial has a very practical orientation. The morning sessions will be devoted to describe the tools functionality and their use. Hands on sessions will be held in the afternoon. Although some example source codes will be available, attendees may bring their oun source or binary codes to analyze (please contact Jesus Labarta before the tutorial if you want to analyize your own SP application). An objective of the tutorial is to demonstrate that a lot of information on the behavior of a given code can be obtained within one day.

This tutorial is limited to 24 attendees.


Tutorial 2: Performance Programming on the SP

by Bob Walkup and David Klepacki, IBM Research

Presentation materials in a Unix TAR archive.

Do you want to write system programs using Perl? If yes, then this tutorial is not for you. This tutorial is for application developers who want to learn how to optimize their Fortran and C/C++ codes for the IBM RS/6000 SP system. Among many things, you will learn:


Do you need to take this tutorial? Take the following test:

If you did not understand the answers to more than one or two of the above, or if a loved one has ever left you for someone with better SP programming skills, then this tutorial is for you. This full day tutorial also has a very practical orientation. Hands-on time will be available to see these techniques put into action.

***** Bonus! *****
FREE
MPItrace library to the first 250 people who register and complete tutorial 2! By simply re-linking your MPI application with this library (and running the resultant executable), you can receive invaluable MPI performance information in asci format. Example output:


This is the trace file for the MINIMUM MBYTES COMMUNICATED 258720103.000000 MBYTES. Information for MPI_Allreduce: AVG. Length = 8.00, CALLS = 9355, WALL = 3.596, CPU = 3.450 MPI_Allreduce: Total BYTES = 74840, BW = 0.021 MBYTES/WALL SEC., BW = 0.022 MBYTES/CPU SEC. AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 8.00 9355 0.021 0.022 3.5962 3.4500 Information for MPI_Barrier: AVG. Length = 0.00, CALLS = 3, WALL = 0.017, CPU = 0.020 MPI_Barrier: Total BYTES = 0, BW = 0.000 MBYTES/WALL SEC., BW = 0.000 MBYTES/CPU SEC. AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 0.00 3 0.000 0.000 0.0172 0.0200 Information for MPI_Bcast: AVG. Length = 5.80, CALLS = 66, WALL = 0.013, CPU = 0.010 MPI_Bcast: Total BYTES = 383, BW = 0.031 MBYTES/WALL SEC., BW = 0.038 MBYTES/CPU SEC. AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 3.97 36 0.018 INF 0.0079 0.0000 8.00 30 0.052 0.024 0.0046 0.0100 Information for MPI_Scatter: AVG. Length = 1008.00, CALLS = 31, WALL = 0.088, CPU = 0.080 MPI_Scatter: Total BYTES = 31248, BW = 0.355 MBYTES/WALL SEC., BW = 0.391 MBYTES/CPU SEC. AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 1008.00 31 0.355 0.391 0.0881 0.0800 Information for MPI_Comm_rank: AVG. Length = 0.00, CALLS = 1, WALL = 0.000, CPU = 0.000 AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 0.00 1 0.000 NaNQ 0.0000 0.0000 Information for MPI_Comm_size: AVG. Length = 0.00, CALLS = 1, WALL = 0.000, CPU = 0.000 AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 0.00 1 0.000 NaNQ 0.0000 0.0000 Information for MPI_Isend: AVG. Length = 2003.69, CALLS = 43023, WALL = 0.893, CPU = 0.890 MPI_Isend: Total BYTES = 8.62045e+07, BW = 96.481 MBYTES/WALL SEC., BW = 96.859 MBYTES/CPU SEC. AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 48.00 7336 2.525 1.853 0.1395 0.1900 168.00 14672 7.613 8.500 0.3238 0.2900 1488.00 7005 76.509 86.862 0.1362 0.1200 5208.00 14010 248.176 251.600 0.2940 0.2900 Information for MPI_Recv: AVG. Length = 2003.69, CALLS = 43023, WALL = 7.481, CPU = 7.340 MPI_Recv: Total BYTES = 8.62045e+07, BW = 11.523 MBYTES/WALL SEC., BW = 11.744 MBYTES/CPU SEC. AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 48.00 7336 0.428 0.518 0.8218 0.6800 168.00 14672 4.193 3.913 0.5878 0.6300 1488.00 7005 4.346 4.474 2.3986 2.3300 5208.00 14010 19.866 19.720 3.6728 3.7000 Information for MPI_Wait: AVG. Length = 2003.69, CALLS = 43023, WALL = 3.739, CPU = 3.740 MPI_Wait: Total BYTES = 8.62045e+07, BW = 23.055 MBYTES/WALL SEC., BW = 23.049 MBYTES/CPU SEC. AVG. Length # of Calls MB/WALL Sec. MB/CPU Sec. WALL Secs. CPU Secs. 48.00 7336 3.790 5.869 0.0929 0.0600 168.00 14672 12.670 12.324 0.1945 0.2000 1488.00 7005 110.526 94.759 0.0943 0.1100 5208.00 14010 21.733 21.651 3.3573 3.3700 Total Communication Information: WALL = 15.8277, CPU = 15.53, MBYTES = 258.72 *****************RESOURCE STATISTICS******************************* The total amount of wall time = 26.229613 The total amount of time in user mode = 27.750000 The total amount of time in sys mode = 0.720000 The maximum resident set size (KB) = 5320 Average shared memory use in text segment (KB*sec) = 1589140 Average unshared memory use in data segment (KB*sec) = 12256108 Average unshared memory use in stack segment(KB*sec) = 0 Number of page faults without I/O activity = 2444 Number of page faults with I/O activity = 6 Number of times process was swapped out = 0 Number of times filesystem performed INPUT = 0 Number of times filesystem performed OUTPUT = 0 Number of IPC messages sent = 0 Number of IPC messages received = 0 Number of Signals delivered = 71 Number of Voluntary Context Switches = 4 Number of InVoluntary Context Switches = 35 *****************END OF RESOURCE STATISTICS*************************

Extremely low overhead timer usage allows very accurate measurements. So don't wait. Register today!!


Presentation Abstracts

IBM Compiler Update - Bob Blainey

The IBM compiler group is currently designing and implementing updated C, C++ and Fortran compilers for the upcoming pSeries eServers based on the Power4 processor. I will describe the work in progress with particular focus on performance issues. I will cover new and enhanced compiler features including OpenMP 2.0, Fortran 2000, support for the Power 4 processor, automatic parallelization and vectorization, and scalar optimization.

Bob Blainey is a Senior Technical Staff Member in the Application Development Technology Centre at the Toronto Lab. Bob's technical responsibility encompasses many areas of compiler technology with a specific focus on leading edge compiler optimization. Bob has partial responsibility for C, C++ and Fortran compiler products on the pSeries eServer, C and C++ compiler products on the zSeries eServer, and Java JIT compilers on various platforms.

Presentation in three Postscript files:

  • scicomp1.ps
  • scicomp2.ps
  • scicomp3.ps

    Bob Blainey
    STSM, Compiler Development
    IBM SWS Toronto Laboratory
    email: blainey@ca.ibm.com
    telephone: 416-448-4264, t/l 778-4264


    TurboSHMEM: A High Performance Implementation of the Cray SHMEM API for the IBM RS/6000 SP - David Klepacki

    Running Cray SHMEM applications on the IBM SP has always been a challenge. Due to the very different semantics of the MPI-2 Put/Get specifications, it has not been possible to port from the SHMEM API efficiently using MPI. However, the IBM LAPI interface is ideal for this purpose. IBM LAPI is a zero-copy, one-sided communication layer that is part of the IBM PSSP software. It is the lowest layer of communication software that is device-independent and open to the SP programmer. A complete implementation of the Cray SHMEM API has been developed for the IBM SP based upon LAPI. The "turbo" qualification is meant to emphasize that optimizations have been included to exploit the shared memory on the SP nodes. This presentation will discuss some of these optimizations and its impact on application performance.

    Presentation in Adobe PDF format.

    David Klepacki
    klepacki@us.ibm.com
    Advanced Computing Technology Center
    IBM T. J. Watson Research Center
    Yorktown Heights, NY, USA


    A DSVM system for non linear structural analysis on IBM RS/6000 SP - Francois Thomas

    We present a programming environment well suited to solving large computational structural mechanics problems on modern parallel computers. This is an extension of the development environment of the Finite Element code CASTEM 2000 which brings the user a global vision on all objects of the parallel application. To ease the implementation of parallel applications, this system hides data transfers between processors and allows a direct reuse of modules of the original sequential code. It is an object-based shared virtual memory system which allows a parallelism by data distribution (for non structured data) or by control distribution; it is therefore well suited to "mechanic" parallelism. To validate this programming environment, domain decomposition techniques have been used. Numerical examples are presented to validate the proposed parallel approach.

    Presentation in PowerPoint format.

    Dr Francois Thomas
    IT Specialist - HPC support
    Francois Thomas/France/IBM@IBMFR
    Email : ft@fr.ibm.com
    Tel: 33-4-67344061

    Dr. Thomas's Thesis is in Computational Fluid Dynamics, at the ENSAM-Paris VI University. His IBM experience began in IBM France in the Scientific Computing organization in 1990, and he has been involved in scientific and technical computing since then. He has worked closely with IBM customers on parallelizing computational fluid dynamics and electromagnetics codes. He is currently scientific benchmark manager in the Products & Solutions Support Center (PSSC) in Montpellier and has been involved in many large RFPs. His certifications include IBM certified specialization in AIX System Administration. His areas of expertise include parallel programming on distributed and shared memory architecture; numerically intensive code parallelization (combustion, electromagnetics, etc.); Fortran, C code optimization and tuning; scientific benchmarking on IBM RS/6000 SP; SP software (Parallel Environment, LoadLeveler, GPFS); and AIX/SP Performance tuning.


    MPI Directions - Richard Treumann

    This talk will provide an update on IBM's prioritization of MPI-2 functionality. In other words, what we think should come next in MPI-2 support. We will take a look at what is going on with the MPI Forum and corrections to the MPI-2 Standard. The most significant issue is about MPI_Alltoallw which is not ideal for 64 bit but should be. We will also look at some other issues related to MPI and the porting of applications to 64 bits.

    Presentation in Interleaf Graphics format.

    Dick Treumann
    RS/6000 SP Development
    IBM Poughkeepsie Unix Development Lab
    Dept 0lva / MS P963
    2455 South Road
    Poughkeepsie, NY 12601
    Tele (845) 433-7846
    Fax (845) 433-8363


    A Technical Overview of Europe's Largest Internet Service Platform Powered by IBM RS/6000 SP Systems - Stefan Radtke

    During the past few years Deutsche Telekom AG has build the largest Internet Service Platform (ISP) in Europe, actually serving more than 8.5 Million users. The core functionality of the platform is based on IBM RS/6000 SP systems, consisting of about 500 SP nodes.

    During the lecture an overview of some services provided by the platform will be given, focusing on RADIUS (Remote Dial In User Service) and Web-Cache functionality. We will show in detail how these services were implemented and how the RS/6000 SP functionalities were leveraged to create a highly available and scalable solution. We describe the tools and techniques used as well as how security issues were covered. Furthermore services like streaming video, audio and their integration into the existing architecture will be illustrated.

    Today the geographically distributed system handles millions of logins and accounting records and performs hudge number of transactions each hour and about 450.000.000 HTTP requests per day. The load balancing and clustering mechanisms used to distribute the load to different types of SP systems and nodes will be shown in detail. These mechanisms dynamically take into account the different capabilities of various SP nodes, processor and memory configurations. Based on traffic observations typical user access patterns can be given for the HTTP protocol . The resulting bandwidth consumption will be used to show why caching is an important feature for an ISP infrastructure in favour of users, network and service providers.

    The tools and methods used for performance monitoring and analysis will be described as well as specially developed simulation tools which were used for capacity estimation and forecasts of system enhancements required in the future. Also a brief description of software distribution in the distributed architecture is part of the lecture.

    Finally, some of the physical IBM RS/6000 SP and storage configurations are discussed considering performance and/or availability requirements.

    Presentation in Adobe PDF format.

    Dr. Stefan Radtke
    Stefan.Radtke@de.ibm.com
    IBM Global Services - Sector Communications, Germany


    Recent user experiences with Blue Horizon at the San Diego Supercomputer Center - Robert Sinkovits, Giri Chukkapalli, Stuart Johnson

    Blue Horizon, an 1152-processor IBM machine located at the San Diego Supercomputer Center, is currently ranked as the 8th most powerful supercomputer in the world. In this talk, we will describe our experiences on Blue Horizon with an emphasis on the effect of two recent system upgrades on application performance.

    In the first of these upgrades, the Nighthawk I nodes were replaced with Nighthawk II nodes. All user applications exhibited better performance, but several obtained speedups that were greater than would be expected purely on the basis of the relative clock speeds. In addition to benchmarking production codes, we also performed careful measurements of L1, L2, and main memory bandwidths. We will discuss the relevance of our findings for user applications.

    The second major upgrade involved replacing the old Trailblazer switch with the newer Colony switch. While most users have seen better performance due to the lower communications latencies and higher bandwidths, several new problems related to the new switch software were discovered. We will describe both the code features that must be avoided in order to run on the Colony switch and, equally important, the anomalous scaling behavior of IBM's implementation of the MPI collective operations.

    Presentation in PowerPoint format.

    Robert Sinkovits, Giri Chukkapalli, Stuart Johnson
    sinkovit@sdsc.edu
    San Diego Supercomputer Center


    Node-Level Coscheduling of Parallel Jobs on Clustered SMP Machines - Greg Johnson

    The development of large parallel computing systems - systems with hundreds or thousands of processors - is increasingly focused on designs involving clusters of multiprocessor shared memory (SMP) nodes connected by a fast network. Nodes typically consist of several (from 2 to 16+) commodity microprocessors with shared access to an amount of memory local to the node. Blue Horizon, a 1152 processor IBM SP (144 nodes each with 8 Power3 II processors) located at the San Diego Supercomputer Center is one example of this type of machine.

    Jobs on this and similar machines are often scheduled to run solo on a given set of nodes. Multiple applications may run together on unique groups of nodes, however no application may run on any processor within a set of nodes currently in use by another.

    In this talk we will examine the effects on system throughput - the wall-clock time required to complete a given job mix - of coscheduling two applications on different processors of the same set of nodes. That is, given 8, 8-processor nodes, how is system throughput affected by running job A on 4 processors of each node, and B on the remaining 4 processors of each node. Early timings have shown surprising results.

    Presentation in Adobe PDF format.

    Greg Johnson
    johnson@sdsc.edu
    University of California San Diego


    The Paradyn Parallel Performance Tool Project - Barton P. Miller

    Paradyn is a system for measuring the performance of parallel and distributed program. Paradyn contains two key technologies that allows it to measure large-scale, heterogeneous programs, without requiring source code modifications, recompiling, or even relinking, measure already-running programs (such as servers), and help automate the search for performance bottlenecks.

    The first technology is dynamic instrumentation, which allows us to insert, change, and remove instrumentation code from a running program. With dynamic instrumentation, we modify the program while it is executing. At the moment that a request is made for performance data, Paradyn inserts the necessary instrumentation into the code. By deferring instrumentation decisions until runtime, we are able dynamically control the amount of overhead we cause. The dynamic instrumentation is now encapsulated in the DynInst API, available as a library. This API allows any tool builder access to Paradyn's ability to patch binary programs during execution. This API offers a machine independent interface to machine-level code patching.

    Paradyn's second technology, automation of the searching for performance bottlenecks, is embodied in the Performance Consultant (PC). The PC is able to direct the dynamic instrumentation to help locate parts of the program that are consuming the most resources.

    I will describe the main features and mechanisms in Paradyn, including new Performance Consultant techniques that use the program's call graph and results from previous runs of the program.

    The current release (3.2) of Paradyn runs on Solaris (SPARC and x86), AIX (including the SP3), Linux, Windows/NT, Irix, or heterogeneous combinations of the these systems.

    Presentation in PowerPoint format.

    Barton P. Miller
    Computer Sciences Department, University of Wisconsin
    bart@cs.wisc.edu


    Interactive Multi-mode Program INstrumentation Tool, IMPRINT - Timothy H. Kaiser

    Most people debug and time programs by manually inserting print statements into source and recompiling. This talk will present the Interactive Multi-mode Program INstrumentation Tool or IMPRINT.

    There are two motivations for developing IMPRINT. The first is to provide a tool to simplify and speed the task of debugging and tracing serial and parallel codes. That is, we wish to provide a tool that is a step above print statements but does not have the complexity of to full featured debugger.

    IMPRINT is based on the IBM developed Dynamic Probe Class Library (DPCL) and PAPI both Parallel Tools Consortium projects. IMPRINT allows users to instrument code on-the-fly. Print statements, timing routines, hardware counters may be inserted into running application, no recompilation is required. IMPRINT is intuitive, providing the look of a source code browser. Users select subroutine calls in the source code browser to instrument. They then select the type of instrument from a menu. The Instrumentation is then inserted into the running application, providing feedback to the user such as entry and exit of a routine, timings, and hardware counter values for a particular routine.

    The second motivation for IMPRINT is to provide a research tool in tool development. IMPRINT is being used to study methodologies for interacting with parallel programs. The JAZZ zoomable user interface is being integrated with IMPRINT to study how data can be presented meaningfully and to study how the tracing and debugging process can be intuitively controlled.

    Timothy H. Kaiser, Ph.D.
    tkaiser@sdsc.edu
    San Diego Supercomputer Center University of California San Diego


    Analysis of the Sweep3D code with Paraver - Jesus Labarta, Judit Gimenez

    This presentation will describe how Paraver, a visualization tool has been used to analyze the performance of the US DOE ASCI Sweep3D benchmark, running on the IBM SP platform. Sweep3D uses a multidimensional wavefront algorithm for "discrete ordinates" deterministic particle transport simulation. A mixed mode MPI+OpenMP version was obtained as starting point for the analysis. The use of the DPCL based instrumentation package OMPItrace and Paraver as visualization tool allowed us to obtain a good picture of the behavior of the different versions of the code. As a result of this deep understanding, different modifications have been introduced in the code allowing us to improve the performance of the application running in all modes (sequential, MPI and OpenMP).

    The presentation proceeds by describing the structure of the application, and starting by the analysis of the sequential version, the visualization of different performance indices allows us to identify potential locality problems and ways to improve them.

    In the originally available version, the performance of the MPI version was much better than that of the OpenMP version. Visualizing thread activity, hardware counter values and other derived metrics, it was possible to identify proper parallelization approaches to end up with an OpenMP version of comparable performance to that of the MPI version (both optimized).

    A side effect of this analysis is the observation that two features that would help OpenMP to achieve at least equivalent performance to MPI codes are the support for nested parallelism and support for pipelined computations. Although requiring a significant programming effort, MPI programers frequently use 2D parallelization schemes and pipelined computations as a way to achieve parallelism in codes with high compute to surface ratio and dependencies. Such extensions to OpenMP would support this important class of parallelization approaches.

    Presentation in PowerPoint format.

    Presentation in Adobe PDF format.

    Jesus Labarta, Judit Gimenez
    jesus@cepba.upc.es
    CEPBA-IBM Research Institute (Technical University of Catalonia -UPC)


    IBM Tools: Update and Directions - Ted Hoover

    The application development environment is an area that is very important to developers in the scientific and technical computing community. This talk presents an update on the areas in which IBM is investing resources to develop products that support end users and tool development efforts for this environment. The first part of this talk gives a brief update of the current state of IBM's tools for high performance computing. The second part provides information about an Open Source development project called Dynamic Probe Class Library (DPCL) and how DPCL can be used to quickly develop tools through the process of dynamic instrumentation. The last part of the talk will present some research activities that leverage the technology found in DPCL in addition to activities to support other tool requirements of the high performance computing community.

    Presentation in HTML format.

    Ted Hoover
    Senior Software Engineer
    Team Lead - Application Development Tools
    IBM Poughkeepsie
    Poughkeepsie NY
    (845) 433-7693
    hoov@us.ibm.com


    ACTC Tools for Application Performance Analysis of Scientific Programs - Luiz Derose

    Application developers have been facing new and more difficult performance tuning and optimization problems as parallel architectures become more complex. In this talk I will present an overview of the application performance analysis tools under development at the Advanced Computing Technology Center for tuning and optimization of applications running on the IBM SP. These tools were designed to help users understand the behavior of applications on complex parallel environments. They are based on dynamic instrumentation, access to hardware performance counters, hints to allow users to correlate the behavior of the application to hardware components, and careful mapping of performance data to source code constructs.

    Presentation in Adobe PDF format.

    Luiz A. DeRose
    Research Staff Member
    Advanced Computing Technology Center
    IBM Research - Yorktown Heights, NY
    (914)945-2828
    T/L: 862-2828
    Fax: (914)945-4269
    laderose@us.ibm.com


    Blue Gene and Blue Light: The Return of Massively Parallel Systems! - Jose E. Moreira

    Blue Gene and Blue Light are two massively parallel systems being developed at the IBM T. J. Watson Research Center. There are many similarities between the two projects. They are both based on cellular architectures, in which a basic building block is replicated many times according to a regular pattern. They both aim at very high-performance, from hundreds of Teraflops to a Petaflop. They both rely on a host system for management and operation. And they both push the envelope in terms of scalability and sheer level of parallelism exploited. Nevertheless, the two projects are also very different in terms of hardware approach, programming models, and system software organization. During this talk, I will give an overview of the architecture of both systems, present proposed programming models for both of them, and discuss the different technologies that each project exercises.

    Jose Moreira is a Research Staff Member and Manager, Blue Gene System Software, at the IBM Thomas J. Watson Research Center. His main responsibilities include the development of system kernel and libraries for Blue Gene, and of host infrastructure for both Blue Gene and Blue Light.

    Jose E. Moreira
    Research Staff Member
    IBM Thomas J. Watson Research Center
    Yorktown Heights NY 10598-0218
    phone: 1-914-945-3987
    fax: 1-914-945-4425
    e-mail: jmoreira@us.ibm.com


    How to use GPFS - A Few Performance Examples - Stefan Andersson

    The general parallel filesystem (GPFS) is the IBM implementation of a parallel filesystem. This presentation shows a few examples how a programmer can access it, in order to take advantage of the performance GPFS can deliver.

    Stefan Andersson is currently working for IBM Heidelberg in the Scientific and Technical Benchmark Team. He has an MS in Mathematics from the University of Heidelberg. He began his work with IBM at the IBM Scientific Center, Heidelberg in 1990. He has been involved in parallel computing on the IBM RS/6000 SP since 1992. From 1997 until 2000 he was working in the benchmark center in Poughkeepsie. His areas of expertise include performance tuning for the POWER2 and POWER3 architectures and tuning and coding for distributed and shared memory on the IBM RS/6000 SP.

    Presentation in Interleaf Graphics format.

    Presentation in Adobe PDAdobe PDF.

    S&TC Benchmarking Team EMEA
    Teammember of ACTC EMEA
    IBM RS/6000 and NUMA-Q -- High Performance & Parallel Computing
    Phone +49-6221-593312
    Fax: ...-59-3400
    eMail: s.andersson@de.ibm.com
    Mobile : +49-(0)172-6330129


    An Integrated Performance Visualizer for MPI/OpenMP Programs - Werner Krotz-Vogel

    Cluster computing has emerged as a defacto standard in parallel computing over the last decade. Now, researchers have begun to use clustered, shared-memory multiprocessors (SMPs) to attack some of the largest and most complex scientific calculations in the world today. To program these clustered systems, people have begun using a combination of MPI and OpenMP. However, analyzing the performance of MPI/OpenMP programs is difficult because, while there are several existing tools to analyze performance of either MPI or OpenMP programs, there is no existing performance analysis tool that integrates performance data from both.

    To remedy this situation, KAI Software and Pallas GmbH have partnered with the Department of Energy through an ASCI Pathforward contract to develop a tool called Vampir/GuideView, or VGV. This tool combines the richness of the existing tools, Vampir for MPI, and GuideView for OpenMP, into a single, tightly-integrated performance analysis tool. From the outset, its design targets performance analysis on systems with thousands of processors.

    However, systems with small numbers of processors will benefit from the innovations developed for this new tool, as well. The tool was designed with the idea that analyzing performance on large computer systems should not be materially different from analyzing performance on small systems.

    The main goals of the VGV project are to create an integrated MPI/OpenMP performance analysis tool that presents its data effectively, and that scales well to even the largest systems currently available. To accomplish these goals, KAI and Pallas cooperated to synthesize instrumentation and recording of the data. We designed a framework in which performance data is organized hierarchically and parallel processing is used to sort and filter the data.

    The integrated tool instruments the program at compile time, generates a trace file at runtime, and does post-run performance analysis and presentation with an integrated combination of Vampir and GuideView that presents data hierarchically. The user can select a region of the displayed data that seems interesting, process the data by sorting and filtering it, then display the results. All performance data is associated with the lines in the source code that produced it, which the user may browse by clicking on the data display. Performance problems such as load imbalance are displayed clearly with the VGV display.

    Since VGV is targeted at the largest of systems, the project is addressing scalability constraints, such as the limited space on a display screen, limited disk storage capacity, and human limitations for absorbing large amounts of displayed data. The tool also attempts to limit its impact on the running of the program.

    VGV is designed to make extensive use of event compression, combination and summarization to limit the size of the trace files generated. It also makes use of multiple, hierarchically-structured trace files. The VGV display uses vertical scrolling, back-to-front stacking of time-lines, and click-activated drill-down into the performance data.

    VGV is intially available on the SP and will be extended to other platforms.

    Work performed under the auspices of the U.S. Dept. of Energy by University of California LLNL under contract W-7405-Eng-48 by KAI Software, Intel Americas, Inc & Pallas GmbH & Lawrence Livermore National Laboratory

    Presentation in HTML format.

    Presentation in PowerPoint format.

    Werner Krotz-Vogel     pallas GmbH
    Hermuelheimer Str. 10     Manager Product Marketing
    D-50321 Bruehl, Germany     Werner.Krotz-Vogel@pallas.com
    fax +49-2232-1896-29     phone +49-2232-1896-0
    http://www.pallas.com     direct +49-2232-1896-21


    OpenMP for Distributed Systems - Don Gunning

    The OpenMP standard for shared memory parallelism has met with great success in the high performance computing community. While OpenMP simplifies and standardizes the parallel programming for users of shared memory systems, those owning SP systems must still resort to message passing or hybrid parallel programming methods to exploit parallelism across multiple nodes. KAI Software has extended OpenMP with a single new directive, "threadshared", which introduces a new storage class and enables shared memory parallelism across multiple single- or multi-cpu nodes in a cluster.

    We will discuss the "threadshared" OpenMP extension, implementation of a distributed shared memory system, performance considerations, and results obtained with KAI's prototype implementation on SDSC's Blue Horizon system.

    Presentation in PowerPoint format.

    Don Gunning
    don.gunning@intel.com
    Intel - Kuck and Associates


    Spectral Simulation of Turbulent Wall Flows - Juan Carlos del Alamo de Pedro, Javier Jiminez Sendmn

    One of the most massive direct numerical simulation (DNS) of a turbulent channel flow is being performed. The aim is to contribute to the understanding of the very large coherent structures present in turbulent wall flows. DNS involves solving all the significant length and time scales present in the flow, without any subgrid modelling. This implies that the cost in memory increases as the 9/4th power of the Reynolds number (Re), a non dimensional parameter which expresses the ratio between advection of momentum and its molecular diffusion. In typical flows of interest this parameter is very high, of order 103 - 108. Yet, the computational box has to be big enough to allow very large structures to develop, which also increases the numerical cost. Our simulation is using 1536 * 1536 * 291 grid points in the longitudinal, spanwise and normal directions, respectively, for a total memory demand of roughly 10GB. It has to run for approximately 40000 time steps. Because of this huge computational cost the code has to be parallel, and message passing (MPI) has been used. The numerical method is fully spectral: Fourier in the x and z directions, in which periodicity is assumed, and Chebichev in the y direction, in order to allow the implementation of walls. This method gives the highest accuracy and resolution, as a consequence of using global information for the calculation of the variables associated to each grid point. This feature of spectral methods requires the parallelization technique to be different from the usual domain decomposition ones, where communication takes place only between certain subdomains of the computational box. In the present case, each processor has to operate on data belonging to different directions of physical space at different times, so the paralellization can be accomplished by transpositions. If N is the number of nodes which the code is running on, then the amount of communicated data per time step is 70(1 - 1/N)GB, so the communication demand ranges from 35 to 70 GB of data. Because of the size of the code, much attention has been taken to minimize cache misses and so, the arrangement of the data in the memory of the computer is modified from one part of the code to another, in order to provide the maximum locality to the memory access. Besides, different non-contiguos data types are used to speed the communication. The resulting is twice faster than other codes used to carry similar calculations, and in spite of its very high communication demand, it has proved to scale remarkably well in SP3 architectures with a low number of nodes.

    Presentation in PowerPoint format.

    Juan Carlos del Alamo de Pedro & Javier Jiminez Sendmn
    juanc@torroja.dmt.upm.es
    E.T.S.I. Aeronazticos, U.P.M.


    Using MPI-IO on GPFS within a Weather Forecast Code - Nicholas Allsopp

    The Integrated Forecast System (IFS) code is a parallel MPI program running multiple tasks each of which, during its execution, writes its outputs to a single global file at the end of each 'output time interval'. It can therefore write output multiple times during a given run. We were interested in finding out the fastest way to get data into a single global file compared to using a gather to a single task and having that task perform a serial write. For this reason we chose to investigate the possible increase in performance the General Parallel File System (GPFS) could bring when using parallel MPI-IO calls to be used with an existing application.


    IBM Solutions for Life Sciences Deep Computing - J. J. Porta

    Very complex and difficult problems are being made tractable by the emerging capabilities in large scale computing, data management and communications. Combining these capabilities with advances in algorithms, analytic methods, modeling and simulation, visualization, data management, and software infrastructures is enabling valuable scientific, engineering, and business opportunities in Life Sciences.

    The IBM Computational Biology Center and The Deep Computing Institute have assembled a worldwide team of scientists to work on a number of research projects involving computational biology, chemistry and material science. Detailed studies by these scientists are aimed at providing new clues for medical diagnostics, the synthesis and design of novel materials and the analysis of genes and their relationships.

    IBM Life Sciences has the mission to bring together the vast array of IBM resources, from research and e-business expertise to data and storage management and high performance computing, to develop and offer new solutions for the Life Sciences market, including biotechnology, genomic/proteomic, e-health, pharmaceutical, and agriscience industries.

    This talk will discuss examples of computational and data intensive Life Sciences problems being addressed by IBM Research and the solutions being developed by IBM Life Sciences.

    Dr. Juan Jose' Porta
    Corporate Technology, Technical Strategy Development
    IBM Corporation
    Route 100
    Somers NY 10589, USA
    Phone: +1-914-766-2213
    Fax: +1-914-766-7212
    mailto:porta@us.ibm.com


    AIX5L - Tom Mathews

    This presentation discusses several key aspects of the next generation of the AIX operating system. AIX5L introduces a new, scalable 64-bit kernel to support higher capacity hardware systems and larger application workloads. As part of the presentation, details of the 64-bit kernel architecture and design are covered along with the customer migration path to the scalable 64-bit kernel environment. Discussion is also focused on the new 64-bit application binary interface provided by AIX5L for the purpose of application scalability. The topics of pthreads and thread scheduling support from AIX Version 4 to AIX Version 5 are also discussed.

    Presentation in Interleaf Graphics format.

    Tom Mathews
    Senior Technical Staff Member
    Manager, AIX Kernel Architecture & Design
    (512) 838-3903, T/L 678-3903, Fax (512)838-3484
    tmathews@us.ibm.com


    Performance Aspects of the POWER3 Processor - Steve White

    This presentation will cover the major processor implementation details which affect application performance, including unit types, dispatch and issue considerations, pipeline interlocks, branch prediction mechanisms, and caches. An overview of the POWER3-based systems describes system-dependent details such as L2 sizes and processor/bus frequencies. The POWER3 hardware prefetch mechanism will also be discussed.

    The last half of the presentation will focus on performance measurements and tuning experiences. Benchmarks referenced include Linpack, stream, and sPPM. Tuning advice briefly covers profiling tools and analysis techniques before concentrating on improving performance via compiler usage (directives and options) and source code modification.


    POWER4 System Structrure - Joel M. Tendler

    POWER4 systems are slated for introduction later this year. Operating at over 1 GHz, these systems include a new microprocessor and a new system interconnection. This talk describes the microarchitecture, including the memory subsystem. Areas that affect application performance in a technical environment are described.

    Presentation in Adboe PDF format.

    Dr. Joel M. Tendler
    Program Director, Technology Assessment, IBM Server Group
    Phone: 512-838-2838 (T/L 678-2838) Fax: 512-838-7637 (T/L 678-7637)
    Internet: jtendler@us.ibm.com