EARLY EXPERIENCE ON NAVO'S 2 TFLOPS WINTERHAWK II
Alan J. Wallcraft
Naval Research Laboratory
SCICOMP 2000
August 14, 2000
NAVO MSRC
One of four ``Major Shared Resource Centers''
within the DOD High Performance Computing
Modernization Program
http://www.navo.hpc.mil
http://www.hpcmo.hpc.mil
Two large systems:
Cray T3E-900
1084 processors
Number 13 on TOP500
IBM SP WinterHawk II
1336 processors (334 nodes)
Number 4 on TOP500
Primary target is large jobs each using:
100 to 500 processors
8 to 24 wall hours
How does these machines differ on this workload?
CRAY T3E VS IBM SP WINTERHAWK II
Cray T3E
Single processor per node
Low communication latency
Good balance between system components
floating point, local memory size and bandwidth, remote memory latency and bandwidth
MPI or SHMEM
IBM SP Winterhawk II
Four processors per node
Moderate communication latency
Significant imbalance between system components
Lack of full memory crossbar
Relatively high inter-node switch switch latency
MPI or MPI plus OpenMP
IBM SP is potentially more capable
Many more configuration options
Tuning may be more important than on T3E
NRL LAYERED OCEAN MODEL (NLOM)
Tiled data parallel coding style
Application specific communication API
MPI-1
SHMEM
Co-Array Fortran / SPMD OpenMP Fortran
uni-processor
Computational routines
Explicitly tiled
CM Fortran / HPF?
Tile-loop-level OpenMP Fortran
Semi-implicit time scheme
Large horizontal extent
Very few layers in the vertical
2-D domain decomposition
1-D for Helmholtz's equation
Vertical always ``on-chip''
Articles in Fall 1999 and Spring 2000
NAVO MSRC Navigator Magazine
HALO BENCHMARK
Simulates NLOM 2-D ``halo'' exchange
N by N subdomain with N = 2...1024
Separate versions for each programming model
Easy to get running
Compare exchange strategies
Inter-compare programming models
No checks for correctness
Implement best method in real application
to confirm that it works
More realistic low level benchmark than ``ping-pong''
Premium on low latency
Also true of NLOM as a whole
Poor HALO performance implies poor NLOM
performance
See
ftp://ftp7300.nrlssc.navy.mil/pub/wallcraf/HALO.tar.gz
HALO BENCHMARK (II)
Same switch on two IBM SP's
NightHawk I (8 CPUs per node)
WinterHawk II (4 CPUs per node)
Switch only supports 4 MPI processes per node
NightHawk I is slightly slower
Large difference between 1 and 4 MPI processes per node
4 MPI processes for MPI alone
1 MPI processes for MPI+OpenMP
Difference in performance favors MPI+OpenMP
HALO BENCHMARK (III)
IBM SP with 4 MPI processes comparable to Origin 2800
IBM SP with 1 MPI process comparable to Sun E10000
Sun E10000 has industry's best MPI
Not a practical IBM SP configuration
HALO BENCHMARK (IV)
One-sided communication usually beats MPI on global memory hardware
Sun E10000 has subset of MPI-2 PUT/GET
Faster (lower latency) than MPI-1 message passing
Cray T3E using SHMEM
Lowest latency and highest bandwidth
IBM SHMEM is very slow
SHMEM requires both LAPI and MPI
Only two SHMEM tasks per node
Native LAPI is presumably similer in performance
IBM SHMEM and LAPI not suitable for low latency applications
HALO BENCHMARK (V)
HALO exchange only involves four remote processors per MPI process
Should run at same speed on any number of processors
If switch is perfectly scalable
WinterHawk II showing little slowdown from 24 to 768 processors
With 3 MPI processes per node
Perhaps a small bandwidth reduction above 200 processors
HALO BENCHMARK (VI)
WinterHawk II with 4 MPI processes per node
Significant slowdown above 96 processors (24 nodes)
Wild variation in performance
Might be due to the Operating System ``stealing'' cycles from the application
Often have to ``reserve'' one or more processors for the O/S when running load balanced SPMD applications
For example, on a 32-processor Origin it is faster to run NLOM on 28, rather than on all 32, processors
HALO BENCHMARK (VII)
Test the O/S overhead theory by running multiple OpenMP threads and spinning the slave threads on a BARRIER while performaning the HALO exchange
Simulates a typical MPI+OpenMP setup.
All cases are on 96 nodes
Using 1 MPI process per node
1 thread and 3 threads are virtually identical
4 threads is significantly slower
Using 2 MPI processes per node
2 threads is significantly slower
All cases involving 4 tasks per node show poor MPI performance.
NA824 AND NA825 BENCHMARKS
NRL Layered Ocean Model (NLOM)
1/32
°
N. Atlantic Subtropical Gyre
NA824: 2048 by 1344 by 5
NA824: 4096 by 2688 by 5
Typical I/O and data sampling
3.05 model days
Excludes initialization time
Because typical run is 30.5 to 91.5 days
Internal subroutine-level timers
Operation count from hardware trace on single node
Only ``useful'' flops are counted
Not our largest problem size
1/32
°
Global Ocean (72S-65N):
8192 by 4608 by 6
See
ftp://ftp7300.nrlssc.navy.mil/pub/wallcraf/NA824.tar.gz
WINTERHAWK II NLOM TUNING
Compilation:
mpxlf -O3 -qstrict
-qarch=pwr3 -qtune=pwr3 -qcache=auto
-qfloat=hsflt -qunroll
-qalias=noaryovrlp -qalias=nopteovrlp -qalign=4k
-lessl -lmass
mpxlf_r -O3 -qstrict
-qarch=pwr3 -qtune=pwr3 -qcache=auto
-qfloat=hsflt -qunroll
-qalias=noaryovrlp -qalias=nopteovrlp -qalign=4k
-qsmp=noauto:omp -qnosave -lessl_r
MPI run time:
MP_EUILIB = us
MP_EAGER_LIMIT = 65536
MP_SHARED_MEMORY = yes
MP_SINGLE_THREAD = yes
OpenMP run time:
OMP_NUM_THREADS = 3
OMP_NUM_TASKS = 3
SPINLOOPTIME = 10000
NLOM NA824 BENCHMARK (I)
Mflops allways based on the same single processor operation count
So total Mflops is inversly proportial to wall time
A constant ``speed per processor'' indicates perfect scalablity
IBM SP WinterHawk II
Using 3 MPI processes per (4 processor) node
IBM SP NightHawk I
Using 4 MPI processes per (8 processor) node
NLOM NA824 BENCHMARK (II)
Same results as above but for Total Speed in Mflops
The 100 Mflops per processor curve is inlcuded for reference
NLOM NA825 BENCHMARK (I)
On the WinterHawk II the appropriate measure of performance is per node, not per processor
Because unused processors will be idle.
In principle they could be used by another job, but shared nodes are not a good idea
The 4-MPI processes per node curves are ``best case''
They show up to 30% variation between runs
Often slower than 3-MPI processes
Super-linear performance indicates a strong ``cache effect''
3 tasks per node faster on large node counts
MPI beats MPI+OpenMP, except on 299 nodes
NLOM NA825 BENCHMARK (II)
Same results as above, but for Total Speed in Mflops
The 400 Mflops per node curve is inlcuded for reference
OPENMP MICOM
Miami Isopycnic Coordinate Ocean Model
Region size: 135 x 256 x 16
Loop-level OpenMP directives
IBM OpenMP Compilation:
xlf_r -qnosave -qsmp=noauto:omp
KAI OpenMP Compilation:
guidef77 -qnoswapomp -qnosave
Default OpenMP run time:
OMP_NUM_THREADS =
Works well, expect for IBM OpenMP
Optimal IBM OpenMP run time:
OMP_NUM_THREADS =
SPINLOOPTIME = 500
YIELDOOPTIME = 500
OPENMP MICOM BENCHMARK (I)
Speedup using only OMP_NUM_THREADS at run time
IBM does not do well in this default mode
OPENMP MICOM BENCHMARK (II)
Identical to the first version
Except for additional environment variables on IBM
Sun E10000 scales the best
SGI Origin also scales well
KAI better than native OpenMP compiler on IBM
Idential scaling on Nighthawk and Winterhawk
IBM and Compaq scaling less well than Sun and SGI
OPENMP MICOM BENCHMARK (III)
Wall time is obviously the most important criteria
Depends on scalability and processor speed
Compaq ES-40 and IBM WinterHawk II are fastest
But they only have 4-processors
Origin showing the best combination of scalability and performance
This is a relatively old machine
There are now 300 MHz Origin 2800's, and the new Origin 3800's
CONCLUSIONS
WinterHawk II about 2x faster per CPU than Cray T3E
Switch does not scale to large numbers of nodes when using all four processors (for MPI or for MPI + OpenMP)
Possibly due to O/S overhead
May be fixable via better tuning
Will the same be true of future switches?
Always reserve one CPU for O/S?
IBM SHMEM and LAPI not suitable for low latency applications
SHMEM requires both LAPI and MPI
Only two SHMEM tasks per node
Native xlf OpenMP still immature
Several common ``1st generation'' bugs
Scales less well than some other OpenMP's
Should improve over time
File translated from T
E
X by
T
T
H
, version 1.30.