The PACE Runtime System
The
purpose of the PACE Runtime System (RTS) is to measure the performance of
program executions with three aims:
- to help identify important program regions
worthy of intensive optimization,
- to provide data to support feedback directed
optimization, and
- to provide a harness that supports measurement-driven
online parameter selection.
With
each generation, microprocessor-based computer systems have become increasingly
sophisticated with the aim of delivering higher performance. With this sophistication comes
complexity. Today, nodes in microprocessor-based systems are typically equipped
with one
or more multicore microprocessors.
Individual processor cores support additional levels of parallelism
typically including pipelined execution of multiple instructions, short vector operations,
and simultaneous multithreading. In addition, microprocessors rely on deep
multi-level memory hierarchies for reducing latency and improving data
bandwidth to processor cores.
As
the complexity of microprocessor-based systems has increased, it has become
harder for applications to achieve a significant fraction of peak performance. Attaining high performance requires
careful management of resources at all levels. To date, the rapidly increasing
complexity of microprocessor-based systems has outstripped the capability of
compilers to map applications onto them effectively. In addition, the memory subsystems in microprocessor-based
systems are ill suited to data-intensive computations that voraciously consume
data without significant spatial or temporal locality. Achieving high performance with
data-intensive applications on microprocessor-based systems is particularly
difficult and often requires careful tailoring of an application to reduce the
impedance mismatch between the application's needs and the target platform's
capabilities.
To
help compilers improve their ability to map applications onto modern microprocessor-based
systems, the PACE RTS will collect detailed performance measurements of program
executions to determine both where optimization is needed and what problems are
the most important targets for
optimization. With detailed insight into an application's performance
shortcomings, the PACE compiler will be better equipped to select and employ
optimizations that address them.
The RTS will include a harness to support online feedback-directed optimization. During compilation, the Platform-Aware Optimizer (PAO) may determine that certain parameters might benefit from runtime tuning. For instance, the best parameter settings for a tiled loop nest may depend upon the cache footprints of other threads running concurrently. To leverage RTS support for online tuning, the PAO will present the RTS with a closure that contains a tuple of initial parameter values (e.g., extents for each dimension of a data tile), a specification of the bounds of the parameter space, a generator function that will explore the parameter space and suggest new parameter tuples, and a parameterized version of the user’s function that will be invoked with the current tuple of parameters. During execution, the RTS will use the provided closure to adjust parameter values to select a configuration that delivers the best performance. Information about the results of online tuning will be provided to PACE's machine learning tools for future use.