The limits of single-thread performance
- As discussed before, CPU frequencies are no longer increasing
because of heat limitations
- Data and control dependencies limit the level of single-thread
(micro) parallelism
- Together these sharply limit single-thread performance
- So increased performance must come from multi-thread (macro)
parallelism
If we can effectively use multi-thread parallellism, it may even be
beneficial to employ less-than-maximal single-thread-performance
- power consumption is an important factor in costs for large
systems (an extreme ex.: the Cray Jaguar consumes 7 megawatts)
- systems are now benchmarked in operations per second per watt (SPECpower; text,
pp. 49-50)
Providing multiple threads
Multiple threads can be provided by
- multithreading a single core
- putting multiple cores on a chip
- putting multiiple chips in a system
Multithreading (text, sec. 7.5)
- allows multiple threads to share functional units of a single
processor
- a separate copy of process state (PC, registers) is kept for each
thread
- fine-grain multithreading
switches threads after each instruction
- reduces stalls due to data hazards, cache misses
- coarse-grain multithreading
switches only on stalls
- simultaneous multithreading
takes advantage of functional unit parallelism on
dynamically-scheduled,, multiple-issue processors by running
instructions from multiple threads simultaneously
- many current CPUs support multithreading
- Sun UltraSPARC T2 chip (2007) has 8 cores which can each run 8
threads
- relatively simple cores, small cache
- well designed for server market
- Intel has "hyperthreading" with 2 threads per core on Core i7
and Atom
Multiple cores per chip
- 2, 4, and 6 cores per chip now common
- steady increase reflecting Moore's law
- Intel just anounced a prototype
48-core
chip
Multiple chips in a system
- 2 or 4 processor chips on a board
- with small server boards ("blades"), up to 128 blades in a rack
Using multiple threads: algorithms and applications
Applications differ in the degree to which they can be parallelized and
the communication required between threads
- web servers are very favorable: lots of parallelism
(multiple users), little communication
- physical simulations (weather, nuclear, biological, fluid flow
[planes]) -- parallelize by dividing space, communicate boundary values
Amdahl's law (text, p. 51):
- if we speed up a fraction P of our program by a factor of S,
overall speedup is
1 / ( (1 - P) + (P / S) )
- with lots of threads, speed-up of overall program is often
limited by portion that remains sequential
Communication (text, sec. 7.3 and 7.4)
Shared-memory multiprocessors
- provide a single address space seen by all processors
- generally the case for multithreading and multicore
- supports a high degree of communication: processes
communicate through shared variables in memory
- facilitates conversion of serial programs to parallel
- requires considerable hardware to handle heavy communication
- bus is cheap but communication bottleneck
- crossbar switch is expensive
- intermediate networks (2d grid, n-cube (text, sec. 7.8))
provide intermediate solutions
- uniform memory architecture:
all memory has roughly the same access time
- non-uniform memory
architecture: some memory can be accessed much more
quickly (typically, local to processor)
Message passing multiprocessors
- requires explicit sending and receiving operations for
communication between processors
No comments:
Post a Comment