Wednesday, 28 August 2013

Multi-cores, multi-threading, multiprocessors

The limits of single-thread performance

  • As discussed before, CPU frequencies are no longer increasing because of heat limitations
  • Data and control dependencies limit the level of single-thread (micro) parallelism
  • Together these sharply limit single-thread performance
  • So increased performance must come from multi-thread (macro) parallelism
If we can effectively use multi-thread parallellism, it may even be beneficial to employ less-than-maximal single-thread-performance
  • power consumption is an important factor in costs for large systems (an extreme ex.:  the Cray Jaguar consumes 7 megawatts)
  • systems are now benchmarked in operations per second per watt (SPECpower;  text, pp. 49-50)

Providing multiple threads

Multiple threads can be provided by
  • multithreading a single core
  • putting multiple cores on a chip
  • putting multiiple chips in a system
Multithreading (text, sec. 7.5)
  • allows multiple threads to share functional units of a single processor
  • a separate copy of process state (PC, registers) is kept for each thread
  • fine-grain multithreading switches threads after each instruction
    • reduces stalls due to data hazards, cache misses
  • coarse-grain multithreading switches only on stalls
  • simultaneous multithreading takes advantage of functional unit parallelism on dynamically-scheduled,, multiple-issue processors by running instructions from multiple threads simultaneously
  • many current CPUs support multithreading
    • Sun UltraSPARC T2 chip (2007) has 8 cores which can each run 8 threads
      • relatively simple cores, small cache
      • well designed for server market
    • Intel has "hyperthreading" with 2 threads per core on Core i7 and Atom
Multiple cores per chip
  • 2, 4, and 6 cores per chip now common
  • steady increase reflecting Moore's law
  • Intel just anounced a prototype 48-core chip
Multiple chips in a system
  • 2 or 4 processor chips on a board
  • with small server boards ("blades"), up to 128 blades in a rack

Using multiple threads:  algorithms and applications

Applications differ in the degree to which they can be parallelized and the communication required between threads
  • web servers are very favorable:  lots of parallelism (multiple users), little communication
  • physical simulations (weather, nuclear, biological, fluid flow [planes]) -- parallelize by dividing space, communicate boundary values
Amdahl's law (text, p. 51):
  • if we speed up a fraction P of our program by a factor of S, overall speedup is
                                1 / ( (1 - P) + (P / S) )
  • with lots of threads, speed-up of overall program is often limited by portion that remains sequential

Communication (text, sec. 7.3 and 7.4)

Shared-memory multiprocessors
  • provide a single address space seen by all processors
  • generally the case for multithreading and multicore
  • supports a high degree of communication:  processes communicate through shared variables in memory
  • facilitates conversion of serial programs to parallel
  • requires considerable hardware to handle heavy communication
    • bus is cheap but communication bottleneck
    • crossbar switch is expensive
    • intermediate networks (2d grid, n-cube (text, sec. 7.8)) provide intermediate solutions
  • uniform memory architecture:  all memory has roughly the same access time
  • non-uniform memory architecture:  some memory can be accessed much more quickly (typically, local to processor)
Message passing multiprocessors
  • requires explicit sending and receiving operations for communication between processors

No comments:

Post a Comment