- Home
- Documents
*Overcoming the Barriers to Sustained Petaflop the Barriers to Sustained Petaflop Performance ......*

prev

next

out of 38

View

213Download

1

Embed Size (px)

Overcoming theBarriers to SustainedPetaflop Performance

William D. GroppMathematics and Computer Sciencewww.mcs.anl.gov/~gropp

Argonne NationalLaboratory Barriers 2006

Why is achieved performance on a singlenode so poor?

1

10

100

1000

Aug-76 Aug-80 Aug-84 Aug-88 Aug-92 Aug-96 Aug-00

Date of Introduction

Clo

ck R

ate

(n

s)

Supercomputer (Cray, NEC) RISC (HP, MIPS) CISC (Intel) Memory

DRAM

Performance

Floating

point

relevant

Floating

point

irrelevant

Argonne NationalLaboratory Barriers 2006

Peak CPU speeds are stable

From

http://www.tomshardware.com/2005/11/21/the_mother_of_all_cpu

_charts_2005/

Argonne NationalLaboratory Barriers 2006

Why are CPUs not getting faster?

Power dissipation problems will force more changes

Current trends imply chips with energy densities greater than

a nuclear reactor

Already a problem: Recalls of recent Mac laptops because

they could overheat.

Will force

new ways

to get

performance,

such as

extensive

parallelism

Argonne NationalLaboratory Barriers 2006

Where will we get (Sustained)Performance?

Algorithms that are a better match for the architectures

Parallelism at all levels

Algorithms and Hardware

Hardware includes multicore,GPU, FPGA,

Concurrency at all levels

A major challenge is to realizethese approaches in code

Most compilers do poorly with important kernels incomputational science

Three examples - sparse matrix vector product,dense matrix-matrix multiply, flux calculation

Argonne NationalLaboratory Barriers 2006

Sparse Matrix-Vector Product

Common operation for optimal (in floating-point operations)

solution of linear systems

Sample code (in C):

for row=1,n

m = i[row] - i[row-1];

sum = 0;

for k=1,m

sum += *a++ * x[*j++];

y[i] = sum;

Data structures are a[nnz], j[nnz], i[n], x[n], y[n]

Argonne NationalLaboratory Barriers 2006

Simple Performance Analysis

Memory motion:

nnz (sizeof(double) + sizeof(int)) +

n (2*sizeof(double) + sizeof(int))

Assume a perfect cache (never load same data twice; only

compulsory loads)

Computation

nnz multiply-add (MA)

Roughly 12 bytes per MA

Typical WS node can move 1-4 bytes/MA

Maximum performance is 8-33% of peak

Argonne NationalLaboratory Barriers 2006

More Performance Analysis

Instruction Counts:

nnz (2*load-double + load-int + mult-add) +

n (load-int + store-double)

Roughly 4 instructions per MA

Maximum performance is 25% of peak (33% if MA overlaps one

load/store)

(wide instruction words can help here)

Changing matrix data structure (e.g., exploit small block structure)

allows reuse of data in register, eliminating some loads (x and j)

Implementation improvements (tricks) cannot improve on these

limits

Argonne NationalLaboratory Barriers 2006

Realistic Measures of Peak PerformanceSparse Matrix Vector Productone vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120

Argonne NationalLaboratory Barriers 2006

Realistic Measures of Peak PerformanceSparse Matrix Vector Productone vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120

Argonne NationalLaboratory Barriers 2006

Realistic Measures of Peak PerformanceSparse Matrix Vector ProductOne vector, matrix size, m = 90,708, nonzero entries nz = 5,047,120

Thanks to Dinesh Kaushik;ORNL and ANL for compute time

Argonne NationalLaboratory Barriers 2006

What About CPU-Bound Operations?

Dense Matrix-Matrix Product

Probably the numerical program most studied by compiler

writers

Core of some important applications

More importantly, the core operation in High Performance

Linpack (HPL)

Should give optimal performance

Argonne NationalLaboratory Barriers 2006

How Successful are Compilers with CPUIntensive Code?

From Atlas

Compiler

Hand-tuned

Enormous effort required to get good performance

Large gap betweennatural code andspecialized code

Argonne NationalLaboratory Barriers 2006

Consequences of Memory/CPUPerformance Gap

Performance of an application may be (and often is) limited by

memory bandwidth or latency rather than CPU clock

Peak performance determined by the resource that is operating

at full speed for the algorithm

Often memory system (e.g., see STREAM results)

Sometimes instruction rate/mix (including integer ops)

For example, sparse matrix-vector operation performance is best

estimated by using STREAM performance

Note that STREAM performance is delivered performance to a

Fortran or C program, not memory bus rate time width

High latency of memory and low number of outstanding loads

can significantly reduce sustained memory bandwidth

Argonne NationalLaboratory Barriers 2006

Performance for Real Applications

Dense matrix-matrix example shows that even for well-studied,

compute-bound kernels, compiler-generated code achieves only a

small fraction of available performance

Fortran code uses natural loops, i.e., what a user would

write for most code

Others use multi-level blocking, careful instruction scheduling

etc.

Algorithms design also needs to take into account the capabilities

of the system, not just the hardware

Example: Cache-Oblivious Algorithms

(http://supertech.lcs.mit.edu/cilk/papers/abstracts/abstract4.ht

ml)

Adding concurrency (whether multicore or multiple processors)

just adds to the problems

Argonne NationalLaboratory Barriers 2006

Distributed Memory code

Single node performance is clearly a problem.

What about parallel performance?

Many successes at scale (e.g., Gordon Bell Prizes for >100TF

on 64K BG nodes

Some difficulties with load-balancing, designing code and

algorithms for latency, but skilled programmers and

applications scientists have been remarkably successful

Is there a problem?

There is the issue of productivity. Consider the NAS parallel

benchmark code for Multigrid (mg.f):

What is the problem?

The user is responsible for all

steps in the decomposition of

the data structures across the

processors

Note that this does give the

user (or someone) a great

deal of flexibility, as the data

structure can be distributed in

arbitrary ways across

arbitrary sets of processors

Another example

Argonne NationalLaboratory Barriers 2006

Manual Decomposition of DataStructures

Trick!

This is from a paper on dense matrix tiling for uniprocessors!

This suggests that managing data decompositions is a common problemfor real machines, whether they are parallel or not

Not just an artifact of MPI-style programming

Aiding programmers in data structure decomposition is an importantpart of solving the productivity puzzle

Argonne NationalLaboratory Barriers 2006

Possible solutions

Single, integrated system

Best choice when it works

Matlab

Current Terascale systems and many proposed petascale systems exploit hierarchy

Successful at many levels

Cluster hardware

OS scalability

We should apply this to productivity software

The problem is hard

Apply classic and very successful Computer Science strategies to address thecomplexity of generating fast code for a wide range of user-defined data

structures.

How can we apply hierarchies?

One approach is to use libraries

Limited by the operations envisioned by the library designer

Another is to enhance the users ability to express the problem in source code

Argonne NationalLaboratory Barriers 2006

Annotations

Aid in the introduction of hierarchy into the software

Its going to happen anyway, so make a virtue of it

Create a market or ecosystem in transformation tools

Longer term issues

Integrate annotation language into host language to ensuretype safety, ensure consistency (both syntactic and semantic),closer debugger integration, additional optimizationopportunities through information sharing,

Argonne NationalLaboratory Barriers 2006

Examples of the Challenges

Fast code for DGEMM (dense matrix-matrix multiply)

Code generated by ATLAS omitted to avoid blindness

Example code from Superscalar GEMM-based Level 3

BLAS, Gustavson et al on the next slide

PETSc code for sparse matrix operations

Includes unrolling and use of registers

Code for diagonal format is fast on cache-based systems but

slow on vector systems.

Too much code to rewrite by hand for new architectures

MPI implementation of collective communication and computation

Complex algorithms for such simple operations as broadcast

and reduce are far beyond a compilers ability to create from

simple code

Argonne NationalLaboratory Barriers 2006

A fast DGEMM (sample)

SUBROUTINE DGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB,

$ BETA, C, LDC )

...

UISEC = ISEC-MOD( ISEC, 4 )