The Platform-Aware Compilation Environment

Preliminary Design Document

September 15, 2010

1The Platform-Aware Compilation Environment project (PACE) is funded by the Defense Advanced Projects Research Agency (DARPA) through Air Force Research Laboratory (AFRL) Contract FA8650-09-C-7915 with Rice University. PACE is part of the Architecture-Aware Compilation Environment program (AACE).

The opinions and findings in this document do not necessarily reflect the views of either the United States Government or Rice University.
Credits

The Platform-Aware Compiler Environment (PACE) project is an inter-institutional collaboration.

<table>
<thead>
<tr>
<th>Organization</th>
<th>Location</th>
<th>Principal Contacts</th>
</tr>
</thead>
<tbody>
<tr>
<td>Rice University (lead)</td>
<td>Houston, TX, USA</td>
<td>Keith D. Cooper, PI</td>
</tr>
<tr>
<td></td>
<td></td>
<td>John Mellor-Crummey</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Erzsébet Merényi</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Krishna Palem</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Vivek Sarkar</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Linda Torczon</td>
</tr>
<tr>
<td>ET International</td>
<td>Newark, DE, USA</td>
<td>Rishi Khan</td>
</tr>
<tr>
<td>Ohio State University</td>
<td>Columbus, OH, USA</td>
<td>P. Sadayappan</td>
</tr>
<tr>
<td>Stanford University</td>
<td>Palo Alto, CA, USA</td>
<td>Sanjiva Lele</td>
</tr>
<tr>
<td>Texas Instruments, Inc.</td>
<td>Dallas, TX, USA</td>
<td>Reid Tatge</td>
</tr>
</tbody>
</table>

The PACE team includes a large number of additional colleagues and collaborators:

Rajkishore Barik, Jebe Bevan, Jean-Christophe Beyler, Zoran Budimlić, Michael Burke, Vincent Cave, Lakshmi Chakrapani, Phillipe Charles, Jack Dennis, Sebastien Donadio, Guohua Jin, Timothy Harvey, Thomas Henretty, Justin Hoelwinski, Zhao Jishen, Sam Kaplan, Kirk Kelsey, Rene Pečnik, Louis-Noël Pouchet, Atanas Rountev, Jeffrey Sandoval, Arnold Schwaighofer, Jun Shirako, Ray Simar, Brian West, Yonghong Yan, Anna Youseffi, Jisheng Zhao

1 Rice University
2 ET International
3 Ohio State University
4 Stanford University
5 Texas Instruments, Incorporated

Technical Contacts: Keith D. Cooper 713-348-6013 keith@rice.edu
Linda Torczon 713-348-5177 linda@rice.edu
Vivek Sarkar 713-348-5304 vsarkar@rice.edu

Design Document Master: Michael Burke 713-348-4476 mgb2@rice.edu

Administrative Contacts: Penny Anderson 713-348-5186 anderson@rice.edu
Lena Sifuentes 713-348-6325 lenas@rice.edu
Darnell Price 713-348-5200 darnell@rice.edu

Web Site: http://pace.rice.edu
Contents

1 Overview of the PACE System .................................................. 1
  1.1 Introduction ................................................................. 1
    1.1.1 Motivation ............................................................. 1
    1.1.2 Roadmap for the Design Document ............................... 2
  1.2 Structure of the PACE System ............................................. 2
    1.2.1 Information Flow in the PACE System ............................. 4
    1.2.2 Storing Knowledge in a Distributed Fashion ..................... 7
  1.3 Adaptation in the PACE Compiler ....................................... 8
    1.3.1 Characterization-Driven Optimization ............................ 8
    1.3.2 Offline Feedback-Driven Optimization ........................... 9
    1.3.3 Online Feedback-Driven Optimization ............................ 10
    1.3.4 Machine Learning ................................................. 10

2 Resource Characterization in the PACE Project ......................... 13
  2.1 Introduction ............................................................... 13
    2.1.1 Motivation ............................................................. 13
    2.1.2 Approach ............................................................... 14
  2.2 Functionality ............................................................... 15
    2.2.1 Interfaces ............................................................... 15
    2.2.2 Inputs ................................................................. 15
    2.2.3 Output ................................................................. 16
  2.3 Method ...................................................................... 17
    2.3.1 Producing Characteristic Values ................................. 18
    2.3.2 Reporting Characteristic Values ................................. 21

3 An Overview of the PACE Compiler ......................................... 25
  3.1 Introduction ............................................................... 25
  3.2 Functionality ............................................................... 25
    3.2.1 Input and Output ..................................................... 25
    3.2.2 Interfaces ............................................................... 26
    3.2.3 The Optimization Plan .............................................. 27
  3.3 Components of the PACE Compiler ..................................... 28
  3.4 Paths Through the PACE Compiler ..................................... 30
  3.5 Optimization in the PACE Compiler .................................. 32
  3.6 Software Base for the PACE Compiler ................................ 33
## CONTENTS

4 **PACE Application-Aware Partitioner**  
4.1 Introduction and Motivation .............................................. 35  
4.2 Functionality .............................................................. 35  
  4.2.1 Input ................................................................. 36  
  4.2.2 Output ............................................................... 36  
  4.2.3 Out of Scope ......................................................... 36  
4.3 Method ................................................................. 36  
  4.3.1 Call Tree Processing: Stage 1 .................................. 37  
  4.3.2 Partitioning: Stage 2 .............................................. 37  
  4.3.3 Source Reorganization: Stage 3 ............................... 38  
4.4 Results ................................................................. 39  
  4.4.1 SPEC Benchmarks .................................................. 39  
4.5 Summary ............................................................... 39  
4.6 Command Line Options .................................................. 39  
  4.6.1 Build Options File ................................................ 39  
  4.6.2 Pruning ............................................................. 40  
  4.6.3 Function Limit .................................................... 40  
  4.6.4 Graphs .............................................................. 41  
  4.6.5 Line Limit ........................................................ 41  
  4.6.6 Program Name .................................................... 41  
  4.6.7 Output Directory ................................................ 41  
  4.6.8 Partitioner Type .................................................. 41  
  4.6.9 Array Padding ...................................................... 41  
  4.6.10 RPU Graph ........................................................ 42  
  4.6.11 Verbose .......................................................... 42  
  4.6.12 Profile ............................................................ 42  
4.7 Build Options File Format ............................................... 42  
4.8 Array Padding .......................................................... 42  

5 **PACE Platform-Aware Optimizer Overview**  
5.1 Introduction .............................................................. 45  
5.2 Functionality ............................................................ 45  
  5.2.1 Input ................................................................. 45  
  5.2.2 Output ............................................................... 45  
5.3 Method ................................................................. 47  
  5.3.1 Front end ........................................................... 48  
  5.3.2 Program Analyses ................................................ 48  
  5.3.3 Legality Analysis ................................................ 48  
  5.3.4 Cost Analysis: Memory Hierarchy ............................ 49  
  5.3.5 Cost Analysis: PAO-TAO Query Interface ................. 50  
  5.3.6 Transcription ...................................................... 51  
  5.3.7 The Optimization Plan ......................................... 51  
  5.3.8 PAO Parameters for Runtime System ....................... 51  
  5.3.9 Guidance from Runtime System ............................ 52  

6 **The Polyhedral Framework**  
6.1 Introduction .............................................................. 53  
  6.1.1 Motivation ......................................................... 53  
  6.1.2 Background ........................................................ 54  
6.2 Functionality .......................................................... 54
## 10 The PACE Runtime System

10.1 Introduction .......................................................... 93
   10.1.1 Motivation ....................................................... 93
10.2 Functionality ........................................................ 94
   10.2.1 Interfaces ....................................................... 95
   10.2.2 Input ........................................................... 95
   10.2.3 Output .......................................................... 95
10.3 Methods .............................................................. 96
   10.3.1 Measurement .................................................... 96
   10.3.2 Profile Analysis ............................................... 98
   10.3.3 Analyzing Measurements to Guide Feedback-directed Optimization .... 99
   10.3.4 Online Feedback-directed Parameter Selection .......................... 99
10.4 Results ............................................................... 100

## 11 Machine Learning in PACE

11.1 Introduction - Machine Learning for Compiler Optimization .................... 101
   11.1.1 Motivation ....................................................... 101
   11.1.2 Prior Work ...................................................... 102
11.2 Functionality ........................................................ 103
   11.2.1 What Machine Learning Will Accomplish ................................ 103
   11.2.2 Optimization Tasks Identified for Machine Learning .................. 104
11.3 Methodology ........................................................ 110
   11.3.1 Abstraction of PACE Problems For Machine Learning .................. 110
   11.3.2 Challenges From a Machine Learning Point Of View .................... 112
   11.3.3 Candidate Machine Learning Approaches ................................ 114
   11.3.4 Productivity metric for Machine Learning ............................. 116
   11.3.5 Infrastructure .................................................. 117
11.4 Conclusions ......................................................... 118

## A Automatic Vectorization in the PACE Compiler

A.1 Overview .............................................................. 119
A.2 Functionality ........................................................ 120
   A.2.1 Input ............................................................ 121
   A.2.2 Output .......................................................... 123
A.3 Method ............................................................... 124
   A.3.1 Dynamic Programming .......................................... 124
## Acronyms Used in This Document

<table>
<thead>
<tr>
<th>Acronym</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>AACE</td>
<td>The DARPA Architecture-Aware Compilation Environment Program, which funds the PACE Project</td>
</tr>
<tr>
<td>PACE</td>
<td>The Platform-Aware Compilation Environment Project, one of four efforts that form AACE; this document describes the design of the PACE environment.</td>
</tr>
<tr>
<td>AAP</td>
<td>The Application-Aware Partitioner, a component of the PACE compiler</td>
</tr>
<tr>
<td>API</td>
<td>Application Programming Interface</td>
</tr>
<tr>
<td>AST</td>
<td>Abstract Syntax Tree</td>
</tr>
<tr>
<td>CFG</td>
<td>Control-Flow Graph</td>
</tr>
<tr>
<td>DARPA</td>
<td>Defense Advanced Research Projects Agency</td>
</tr>
<tr>
<td>gcc</td>
<td>Gnu Compiler Collection, a widely-used open-source compiler infrastructure</td>
</tr>
<tr>
<td>HIR</td>
<td>High-Level Intermediate Representation</td>
</tr>
<tr>
<td>ILP</td>
<td>Instruction-Level Parallelism</td>
</tr>
<tr>
<td>IR</td>
<td>Intermediate Representation</td>
</tr>
<tr>
<td>ISA</td>
<td>Instruction-Set Architecture</td>
</tr>
<tr>
<td>LLVM</td>
<td>An open-source compiler that is used as the code base for the PACE TAO</td>
</tr>
<tr>
<td>LLVM IR</td>
<td>The low-level, SSA-based IR used in LLVM and the TAO</td>
</tr>
<tr>
<td>ML</td>
<td>The PACE Machine Learning subproject and tools</td>
</tr>
<tr>
<td>MPI</td>
<td>A standard API for programming distributed-memory parallel machines</td>
</tr>
<tr>
<td>OpenMP</td>
<td>A standard API for programming shared-memory parallel computers</td>
</tr>
<tr>
<td>PAO</td>
<td>The Platform-Aware Optimizer, a component of the PACE compiler</td>
</tr>
<tr>
<td>PAO→TAO</td>
<td>The translator from the SAGE III IR to the LLVM IR, a component of the PACE compiler; also called the Rose-to-LLVM translator</td>
</tr>
<tr>
<td>POSIX</td>
<td>An international standard API for operating system functionality</td>
</tr>
<tr>
<td>RC</td>
<td>The PACE Resource Characterization subproject and tools</td>
</tr>
<tr>
<td>RCFG</td>
<td>Region Control-Flow Graph</td>
</tr>
<tr>
<td>RISC</td>
<td>Reduced Instruction-Set Computer</td>
</tr>
<tr>
<td>RST</td>
<td>Region Structure Tree</td>
</tr>
<tr>
<td>RTS</td>
<td>The PACE Runtime System subproject and tools</td>
</tr>
<tr>
<td>RPU</td>
<td>Refactored Program Unit, produced by the PACE AAP</td>
</tr>
<tr>
<td>SAGE III IR</td>
<td>The IR used in Rose, an open source compiler that is the code base for the PACE PAO</td>
</tr>
<tr>
<td>SCoP</td>
<td>Static Control Part, a single loop nest that is amenable to polyhedral transformations</td>
</tr>
<tr>
<td>SSA</td>
<td>Static Single-Assignment form</td>
</tr>
<tr>
<td>TAO</td>
<td>The PACE Target-Aware Optimizer, a component of the PACE compiler</td>
</tr>
<tr>
<td>TLB</td>
<td>Translation Lookaside Buffer, a structure in the memory hierarchy that caches information on virtual to physical page mapping</td>
</tr>
</tbody>
</table>
Chapter 1

Overview of the PACE System

The Platform-Aware Compilation Environment (PACE) is an ambitious attempt to construct a portable compiler that produces code capable of achieving high levels of performance on new architectures. The key strategies in PACE are the design and development of an optimizer and runtime system that are parameterized by system characteristics, the automatic measurement of those characteristics, the extensive use of measured performance data to help drive optimization, and the use of machine learning to improve the long-term effectiveness of the compiler and runtime system.

1.1 Introduction

The Platform-Aware Compilation Environment (PACE) Project is developing tools and techniques to automate the process of retargeting an optimizing compiler to a new system. The basic approach is to recast code optimization so that both the individual optimizations and the overall optimization strategy are parameterized by target system characteristics, to automate the measurement of those characteristics, and to provide both immediate runtime support and longer term intelligent support (through machine learning) for the parameter-driven optimization. PACE is part of a larger effort, the DARPA-sponsored Architecture-Aware Compiler Environment (AAACE) program.¹

The PACE environment approaches performance portability in a new way. Rather than focusing on the details of generating machine code for a new system, PACE relies on a pre-existing native C compiler for the target system and focuses on generating customized C source code for that compiler. In essence, PACE uses the native compiler as its code generator. As with all good compilers, PACE tries to transform the code that it feeds the code generator into a shape from which the code generator can produce efficient code.

1.1.1 Motivation

Over the last twenty years, the average time to develop a high-quality compiler for a new system has ranged between three and five years. Given the rapid evolution of modern computer systems, and the correspondingly short lifetimes of those systems, the result is that quality compilers appear for a new system only at the end of its useful lifetime, or later.

Several factors contribute to the lag time between appearance of a new computer system and the availability of high-quality compilation support for it. The compiler may need to deal with new features in the target system’s instruction set architecture (ISA). Existing optimizations must

¹ The PACE project is funded by the Defense Advanced Projects Research Agency (DARPA) through Air Force Research Laboratory (AFRL) Contract FA8650-09-C-7915 with Rice University. The opinions and findings in this document do not necessarily reflect the views of either the United States Government or Rice University.
be retargeted to the new system;\(^2\) those optimizations may not expose the right set of parameters to simplify retargeting. Finally, the new system may present system-level features that are not well addressed by existing optimizations, such as the DMA interfaces on the IBM CELL processor. In such cases, the retargeting effort may require invention and implementation of new transformations to address system-specific innovations.

The \textsc{P}ACE system attacks the first two problems.

- \textsc{P}ACE will rely on a native C compiler for code generation—that is, to emit the appropriate assembly language code. Native compilers will vary in the quality of the code that they produce. For a native compiler that optimizes code well, \textsc{P}ACE will leverage that investment and rely on the native compiler for the transformations that it does well. For a less capable native compiler, \textsc{P}ACE will include a suite of optimizations that can produce low-level C code that compiles directly into reasonably efficient assembly code.

- \textsc{P}ACE will include a suite of transformations that are parameterized by target-system characteristics, both hardware and software. These transformations will use specific, measured characteristics to model the target system and will reshape the code accordingly. These transformations will be retargeted by changing the values of the system characteristics that they use as parameters. The behavior of the compiler will change with the values of the system characteristics.

\textsc{P}ACE does not address the final problem, inventing new optimizations for radical new features. It will, however, free the compiler writer to focus on new transformations to address new architectural features.

Thus, \textsc{P}ACE transforms the problem of tuning the optimizer for a new system into the problem of deriving values for key system characteristics. \textsc{P}ACE includes a set of portable tools that measure those characteristics. Thus to retarget the optimizer, an installer runs the characterization tools and installs the compiler.

Finally, because the values of some important characteristics cannot be determined accurately until runtime, \textsc{P}ACE includes a runtime system that can adjust optimization parameters in compiler-generated code. The runtime system makes specific and precise measurements of runtime performance. It is capable of identifying rate-limiting resources by code region. It can report the results of these analyses to either the end user or to the other components in the \textsc{P}ACE system.

1.1.2 Roadmap for the Design Document

This chapter provides a survey of the structure and functionality of the \textsc{P}ACE system, along with discussion of system-wide design decisions. § 1.2 provides a description of the major software components components of the \textsc{P}ACE system, shown in Figure 1.1. The later chapters of this document describe those components in more detail. Table 1.1 shows how the remaining chapters of this design document map into the software components of the \textsc{P}ACE system.

1.2 Structure of the \textsc{P}ACE System

The \textsc{P}ACE project has four major components: the \textsc{P}ACE Compiler, the \textsc{P}ACE Runtime System, the \textsc{P}ACE Resource Characterization tools, and the \textsc{P}ACE Machine Learning tools. Figure 1.1 shows the major components of the \textsc{P}ACE system.

- The \textsc{P}ACE Compiler is a source-to-source optimizing compiler that tailors application code for efficient execution on the target system. It accepts as input parallel programs written in C

\(^2\)Datta et al. showed that variations in target machine architecture necessitate different optimization strategies for stencil computations [31]. Equally important, follow-on analysis showed that code tailored for any machine in their study performed poorly on any other machine [54].
with either MPI or OpenMP calls. It produces, as output, a C program that has been tailored to the system’s measured characteristics.

- **The PACE Runtime System** provides support for program execution. It measures application performance and reports those results to both the user and other PACE tools. It works in concert with the PACE Compiler to provide runtime tuning of specific optimization parameters, such as tile sizes for blocking.

- **The PACE Resource Characterization Tools** measure the performance-sensitive characteristics of the target system that are of interest to the PACE Compiler and the PACE Runtime System. The tools measure the resources available to a C program, which may differ from the documented limits of the underlying hardware.

- **The PACE Machine Learning Tools** perform offline analysis of application performance, using data from the runtime system, and of compiler behavior. The tools develop recommendations for the compiler and the runtime system. The tools may also play a role in analyzing the impact of sharing on available resources.
To configure an installation of the PACE system on a new computer, the system manager installs the software, produces a configuration file, and invokes the characterization tools. The characterization tools produce the data used by the other components in PACE to tailor the system and its behavior to the target system.

The configuration file contains base information about the target system and its software. The content of the configuration file is the subject of ongoing negotiations between the AACE Task 1 and Task 2 teams. The format of the configuration file is specified by the Task 2 teams. All AACE teams will use the same configuration file format.

1.2.1 Information Flow in the PACE System

Figure 1.2 expands on Figure 1.1 to show the tools that make up the major PACE components and to show the interfaces between the various tools. Thick black lines represent the flow of code. Thin lines represent the flow of information other than code; they are colored to match the tool that generates the information. (Double-ended lines are colored to match one of the two tools they join.) The chapters that describe the individual components (see Table 1.1) provide detail on how each of these interfaces work.

The Compiler

To compile an application using the PACE tools, the programmer must create a directory that contains the source code for the application and any libraries that are to be optimized with it. PACE will create, within the application directory, a working directory to hold its work products (e.g., refactored versions of the source code, annotations, performance results, and records of prior com-
1.2. STRUCTURE OF THE PACE SYSTEM

This working directory becomes part of the PACE system’s distributed repository. To produce executable code, the programmer invokes the compiler driver.

To compile code, the programmer invokes the compiler driver in the application directory. The compiler driver then sequences the individual components of the PACE Compiler to optimize the application and to produce executable code for it (see § 3.3). Depending on the application and its optimization plan (see § 1.2.2), the compiler driver may use the Application-Aware Partitioner (AAP), the Platform-Aware Optimizer (PAO), the PAO→TAO IR Translator, the Target-Aware Optimizer (TAO), and the native compiler to create one of three distinct compilation paths.

- The compiler driver may follow the full compilation path, using all of the PACE tools to optimize the application and generate transformed C source code, which it then compiles with the native C compiler.
- If the native compiler has strong optimization capabilities, the compiler driver may follow a short compilation path, in which it relies on the native compiler to perform some of the optimization. This path uses a subset of the PACE Compiler components.
- If target system is one for which the PACE Compiler provides backend support, the compiler driver may use PACE Compiler components to optimize the code and to generate native code for the application.

In each of these scenarios, the compiler driver also invokes the linker to create the actual executable code. During compilation, the PAO may invoke the TAO to obtain low-level, detailed information about the expected performance of alternate code sequences (see § 9.3.4).

The Runtime System

The Runtime System (RTS) provides performance monitoring and runtime parameter tuning. The PACE Compiler prepares an executable for the RTS by including the runtime hooks necessary to initialize the RTS, and by constructing a measurement script that sets environment variables and flags that control and direct the measurement system. The user invokes the executable through a measurement script.

When invoked, the RTS will interpose itself between the application and the operating system to intercept events such as program launch and termination, thread creation and destruction, signal handler setup, signal delivery, loading and unloading of dynamic libraries, and MPI initialization and finalization. It then launches the application, monitors its behavior using a variety of mechanisms (see § 10.3.1), and records the results.

The runtime system also provides an interface for runtime selection of optimization parameters. The compiler rewrites the code region into an optimized, parameterized form and builds the various data structures and support routines needed by the RTS harness for online feedback-directed optimization (see § 10.3.4).

Managing Annotations

One of the system-wide challenges in PACE is providing coordination and communication among the distinct PACE tools. The RTS measures performance information. For that information to be of use to the rest of PACE, it must relate to a specific compiled executable and a specific run of that executable. Rather than build and maintain a central repository, PACE

---

3Since the PACE Target-Aware Optimizer is built on top of the open-source LLVM system, this option exists on systems that have a native LLVM backend. LLVM already supports several backends. Expanding that set is not in the scope of the PACE project. We will explore the tradeoff between direct generation of native code and using the LLVM bytecode to JIT compilation path.

4It is possible to invoke a PACE-compiled executable without invoking the RTS. The preferred mechanism to achieve that goal is to invoke it through the measurement script, with the appropriate parameter settings to disable runtime performance monitoring.
uses a distributed repository that stores such information with the application’s source code. To avoid the need for large, complex maps that relate annotations to specific compilations and executions, PACE embeds duplicate copies of needed information directly into the executable code as static data. For example, the compiler injects into each object file a concise representation of the optimization plan that produced it. The RTS, in turn, collects those records, adds them to the performance results, annotates the package with information about the target machine on which the run occurred, and writes this information into the annotations subdirectory of the application’s working directory.

To ensure that the other PACE tools find all the relevant performance history records, the RTS must register the fact that it has run an application with both the PACE Compiler and the PACE Machine Learning tools. Each of those systems provides a callback to the RTS that records the name of the executable, the time of the run, and the location of the performance history information.

The Characterization Tools

The PACE resource characterization (RC) tools are a standalone package designed to measure the performance characteristics of a new system that are important to the rest of the PACE system, and to provide a simple consistent interface to that information for the other PACE tools. The RC tools are written in a portable style in the C programming language; they rely on entry points from the standard C libraries and the POSIX operating system interface. The specific characteristics that PACE will measure in Phase 1 of the AACE program are described in § 2.2.3.

Measured versus Absolute Numbers In many cases, the RC tools capture an effective number for the parameter, rather than the actual number provided by the underlying hardware. The effective quantity is, in general, defined as the amount of that resource available to a C program. For example, the effective number of floating-point registers available to a C program depends, in part, on the number of such registers provided by the underlying hardware. However, the compiler is almost always the limiting factor in this case. If the hardware provides 16 floating-point registers, but the compiler cannot allocate more than 14 of them to variables in the application code, then the effective number should be 14 registers.\(^5\)

In some cases, a hardware characteristic may not be discernible from a C program. In those cases, the PACE Compiler cannot rely upon that characteristic in optimization, since the C code cannot control the behavior. Associativity in the memory hierarchy is a good example of this problem. If the L2 cache on a processor is physically mapped, the mapping between a source-level data structure, such as a large array, and its cache locations depends on the mapping of virtual memory pages to physical page frames, and the tools cannot measure the cache associativity with any certainty.

Methodology In general, the PACE RC tools provide one or more microbenchmarks to measure a given characteristic. A microbenchmark is a small code kernel designed to provoke a specific response, coupled with code to analyze the kernel’s behavior. Typically, that response is a change in the time that the kernel requires to perform a fixed number of operations. Automatic analysis of the kernel’s behavior can be complex; effects that a human can read easily from a graph can be difficult to isolate numerically.

The RC tools produce information that can be accessed through two distinct interfaces: one designed for the grading tools built by the AACE Task 2 teams and the other designed for internal

\(^5\)For example, some register allocators will reserve a small number of registers to use for values that are used too infrequently to merit their own register. If the compiler reserves two registers for that purpose, it reduces the effective number available to the application. The PACE Compiler, which will control the number of simultaneously live floating-point values, should not plan on using those last two registers. Similarly, sharing in the cache hierarchy (between instructions and data or between cores) can lead to effective cache sizes that are significantly smaller than the hardware cache size.
use in the PACE system. The grading interface is a flat ASCII file in an XML schema designed by the Task 2 teams. The internal interface is a procedural interface that PACE tools can call to obtain individual values.

The Machine Learning Tools

The PACE Machine Learning (ML) tools will augment specific decision making processes within the PACE system, through analysis of past experience and behavior. Over time, the ML tools will improve the behavior of the other PACE tools. A modern compilation environment, such as PACE, can produce reams of data about the application itself, the process used to compile it, and its behavior at runtime. Unfortunately, the application’s runtime performance can depend in subtle ways on an unknown subset of that information, and neither humans nor algorithmic programs are particularly good at discerning those relationships.

The ML tools are tied closely to specific components in the PACE system, where they provide additional input, in the form of directives, refined input parameters, or changes to optimization plans (see Figure 11.1 on page 103). The ML tools draw their inputs from the other PACE tools, as shown in Figure 1.2. The tools query the resource-characterization interface directly; they find the input data from the compiler and the runtime system where it is stored with the application.6 The ML tools will have their own private repository where they can store their context, data, and results.

To facilitate offline learning, the PACE system needs a mechanism that invokes the offline portions of the ML tools on a regular basis. PACE will use the POSIX crontab facility to schedule regular invocations of the offline PACE ML tools. Problems that are solved online will invoke the appropriate ML tools directly.

1.2.2 Storing Knowledge in a Distributed Fashion

Early diagrams of the PACE system included a centralized knowledge base. While those diagrams are sound in concept, many minor problems arise in the implementation and use of the tools if the knowledge base is a central, shared structure. Thus, the implementation of the PACE system will store its knowledge about an application with the application’s source code. As a result, information and annotations generated by the PACE tools are stored in multiple locations. These locations form a distributed repository, rather than a centralized repository.

Consider, for example, the collection of information that governs how the PACE Compiler optimizes and translates an application. In a traditional compiler, that control information is encoded in a series of command-line flags to the compiler. While such flags are useful, their very form limits their ability to express complex control information. In the PACE system, each application has an associated optimization plan that specifies how the compiler should optimize the code. The optimization plan is a persistent document that specifies both the compilation path and the optimizations that the compiler should use. It may also include parameters to individual optimizations, suggested application orders for those optimizations, or commands to control the individual components.

Since each of the compiler components consults the optimization plan, the various components can modify each other’s behavior by making changes to the optimization plan. This simple mechanism facilitates feedback-driven adaptive compilation, by allowing an adaptive controller to explore and evaluate the space of possible optimization strategies over multiple compile-execute cycles. It also allows one phase of compilation to change the behavior of another; for example, if the TAO discovers (late in translation) that some inline substitution in the AAP has made some procedure too long for the native compiler to handle, it can modify the parameters that govern the

6Each execution of a PACE-compiled executable will use a call-back mechanism to register the location of its compilation and performance data. The call-back mechanism will record that information in the ML tools’ repository.
### Table 1.2: Kinds of Adaptation in the PACE Compiler

<table>
<thead>
<tr>
<th>Kind of Adaptation</th>
<th>Characteristic Driven</th>
<th>Offline Feedback-Driven</th>
<th>Online Feedback-Driven</th>
<th>Machine Learning</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time Frame</td>
<td>Install time</td>
<td>Across compiles</td>
<td>Runtime</td>
<td>Across compiles</td>
</tr>
<tr>
<td>Affects</td>
<td>All applications</td>
<td>One application</td>
<td>One application</td>
<td>All applications</td>
</tr>
<tr>
<td>Adapts to</td>
<td>System</td>
<td>System Application</td>
<td>System Application Data</td>
<td>System Application</td>
</tr>
<tr>
<td>Initiated by</td>
<td>RC tools</td>
<td>various</td>
<td>PAO</td>
<td>ML tools</td>
</tr>
<tr>
<td>Changes Behavior of</td>
<td>AAP, PAO, TAO</td>
<td>AAP, PAO, TAO</td>
<td>RTS</td>
<td>AAP, PAO, TAO</td>
</tr>
<tr>
<td>Persistence</td>
<td>Until next run of RC tools</td>
<td>Short-term</td>
<td>Records results for ML and PAO</td>
<td>Long-term</td>
</tr>
</tbody>
</table>

Adaptation is a key strategy embodied in the PACE compiler. All of the compiler components can change their behavior in response to either internal or external feedback. Adaptation in the PACE compiler falls into two categories: short-term adaptation that tailors the behavior of one executable and long-term learning that changes the behavior of the compiler. We will implement four different mechanisms to achieve adaptation: (1) characterization-driven adaptation, (2) offline feedback-driven adaptation, (3) online feedback-driven optimization, and (4) long-term machine learning. The mechanisms are summarized in Table 1.2 and described in the following sections.

In combination, these four mechanisms provide the compiler with the ability to adapt its behavior to the target system, the application, and the runtime situation. These mechanisms will allow the PACE system to be flexible in its pursuit of runtime performance. We anticipate that interactions between these mechanisms will produce complex optimization behavior.

#### 1.3.1 Characterization-Driven Optimization

The concept of characterization-driven optimization forms the core of both the AACE program and the PACE project. AACE compiler environments will include tools that measure performance.

---

7 A centralized knowledge base can create the situation where the user either cannot run an executable unless it has network access to the knowledge base or the user loses all feedback information from such runs. Neither is a good scenario.
1.3. ADAPTATION IN THE PACE COMPILER

critical characteristics of the target system and transformations that use those measured characteristics as an integral part of the optimization process. In the PACE compiler, for example, the non-polyhedral loop optimizations will use the measured parameters of the memory hierarchy to choose tile sizes, while the tool that regenerates C source code will tailor the number of concurrently live values to the number of such values that the target system’s compiler can maintain in registers. We will also investigate new transformations suggested by the characterization work.\(^8\)

Characterization-driven adaptation is a simple form of long-term learning. It relies on algorithmic adaptation to pre-determined parameters. The compiler writers identify parameters that the RC tools should measure. They implement the transformations that use the results from the RC tools. This process automatically adapts the transformation to the target system; it does not take into account any properties of the application or its data set.

Characterization-driven optimization makes its adaptation at installation time, when the RC tools run. The adaptation can be repeated by running the RC tools to generate a new target-system characterization. The results of this adaptation are persistent; they last until the RC tools are re-run.

1.3.2 Offline Feedback-Driven Optimization

The second strategy for adaptation in the PACE compiler is the use of offline feedback-driven optimization. This strategy produces a short-term adaptation. The actual mechanism for implementing feedback-directed optimization in PACE is simple. The AAP, PAO, and TAO each consult the application’s optimization plan before they transform the code (see §3.2.3). Changes to the optimization plan change the behavior of these components. This design simplifies the implementation and operation of an adaptive compiler. It does not, however, provide a clear picture of how PACE will perform offline, feedback-driven adaptation.

In principle, any component in the PACE system can change the optimization plan for the current compilation of an application. In practice, we will explore three strategies for controlling offline feedback-driven adaptation.

- The compiler driver may use an external adaptive controller to change the optimization plan across multiple compile-execute cycles. We anticipate that this mechanism would modify gross properties of optimization, such as the specific transformations applied and their relative order or the compilation path (full, short, or LLVM backend).

- Any phase of the compiler may contain an optimization pass that performs self-adaptation. For example, the non-polyhedral loop optimizations in the PAO might consider several transformation strategies; to choose among them, it can generate each alternative version of the loop nest and invoke the PAO–TAO query mechanism to have the TAO estimate some aspects of performance. In a similar way, the TAO might consider multiple strategies for algebraic reassociation and choose between them based on an estimate of execution efficiency from the instruction scheduler.

- One phase of the compiler may change the optimization plan for another phase, based on the code that it generates. We envision this facility as serving two different needs. It allows one phase to disable transformations that might reduce the impact of a transformation that it has applied. For example, the PAO might disable loop unrolling in the TAO to prevent the TAO from de-optimizing a carefully tiled loop nest. This adaptation occurs within a single compilation.

Alternatively, one phase might provide feedback to another phase in the next compilation. For example, if the TAO discovers that the code needs many more registers than the target

\(^8\)The polyhedral optimizations generate code that is parameterized by tile sizes; the mechanism that selects values for those parameters will use the results generated by the RC tools.
system (hardware + compiler) can supply, it might change the AAP’s optimization plan to forbid inline substitution in the region. Similarly, it might tell the PAO to reduce its unroll factors.

While these offline feedback-driven adaptations can produce complex behavior and subtle adaptations, their primary impact is short term; they affect the current compilation (or, perhaps, the next one). They do not build predictive models for later use, so they are not learning techniques.\(^9\)

### 1.3.3 Online Feedback-Driven Optimization

The third strategy for adaptation in the PACE system is the use of online feedback-driven optimization. Because the performance of optimized code can depend on the runtime state of the system on which it executes, even well planned and executed transformations may not produce the desired performance. Issues such as resource sharing with other cores and other processors and interference from the runtime behavior of other applications can degrade actual performance.

To cope with such dynamic effects, PACE includes a mechanism that lets the compiler set up a region of code for runtime tuning. The PAO establishes runtime parameters to control the aspects of the code that it wants the runtime to adjust. It generates a version of the code for that region that uses these control parameters to govern the code’s behavior. Finally, it creates a package of information that the RTS needs to perform the runtime tuning (see \(\S\) 10.3.4). The RTS uses that information to find, at runtime, settings for the control parameters that produce good performance. The result should be an execution that tunes itself to the actual runtime conditions.

As an example, consider blocking loops to improve locality in the memory hierarchy. The compiler could assume that it completely understood memory behavior and use fixed tile sizes. Alternatively, it could recognize that interference from other threads and other applications can impact optimal tile size, and thus it could generate code that read tile dimensions from a designated place in memory. In this latter scheme, the runtime system could use performance counter information, such as the L2 cache miss rate, to judge performance and vary the tile size accordingly.

The PACE RTS both defines and implements an API for online, feedback-driven optimization (see \(\S\) 10.3.4). The API lets the compiler register tunable parameters and suggested initial values, and provides a runtime search routine (an adaptive controller) that the RTS can use to vary those parameters. The RTS will collect the data needed by the runtime search routine and ensure that it is invoked periodically to reconsider the parameter values.

Online feedback-directed optimization produces a short-term adaptation of the application’s behavior to the runtime situation—the dynamic state of the system and the input data set. The technique, by itself, does not lead to any long-term change in the behavior of either the PACE system or the application. However, the RTS will record the final parameter values along with its record of the the application’s performance history. Other components in PACE may use these final parameter values as inputs to long-term learning.

### 1.3.4 Machine Learning

The fourth strategy for adaptation in the PACE system is to apply machine learning techniques to discover relationships among target system characteristics, application characteristics, compiler optimization plans, and variations in the runtime environment. Machine learning is, by definition, a long-term strategy for adaptation. The PACE ML tools will derive models that predict appropriate optimization decisions and parameters. We have identified several specific problems to attack with ML techniques (see \(\S\) 11.2.2).

\(^9\)In the ACME system, we coupled this kind of adaptation with a persistent memoization capability and randomized restart. The result was a longer-term search incrementalized across multiple compilation steps [25].
1.3. ADAPTATION IN THE PACE COMPILER

A central activity in the design of a machine-learning framework for each of these problems is the design of a feature vector for the problem—the set of facts that are input to the learned model. The PACE system provides an information rich environment in which to perform learning; the ML tools have the opportunity to draw features from any other part of the environment—the RC tools, the compiler tools, and the RTS tools. The determination of what features are necessary to build good predictive models for various compiler optimizations is an open question and a significant research issue in PACE. We anticipate that, in the project’s early stages, we will need to construct tightly controlled experiments to build the input data. As the project progresses, we will study how to distill the necessary data from performance records of actual applications compiled with the PACE system.

The goal of learning research in the PACE project is to create a process that will automatically improve the PACE system’s behavior over time. Offline learning tools will examine records of source code properties, optimization plans, and runtime performance to derive data on optimization effectiveness, and to correlate source-code properties with effective strategies. This knowledge will inform later compilations and executions.

The PACE compiler will use ML-derived models directly in its decision processes. As the ML models mature, the compiler will replace some static decision processes and some short-term adaptive strategies with a simpler implementation that relies on predictions from ML-derived models.
Resource Characterization plays a critical role in the PACE Project’s strategy for building an optimizing compiler that adapts itself and tunes itself to new systems. The PACE Compiler and the PACE Runtime System need access to measurements of a variety of performance-related characteristics of the target computing system. The goal of the PACE Resource Characterization subproject is to produce those measured values.

2.1 Introduction

The ability to derive system performance characteristics using portable tools lies at the heart of the AACE Program’s vision and the PACE Project’s strategy for implementing that vision. The Resource Characterization (RC) subproject of PACE is building tools, written in a portable style in the C language, to measure the specific performance characteristics that are of interest to the PACE Compiler and the PACE Runtime System.

2.1.1 Motivation

The PACE Compiler and the PACE Runtime System rely on the values of a number of performance-related system parameters, or characteristics, to guide the optimization of an application program. The RC subproject is developing tools that produce those specific values in reliable, portable ways.

The design of the PACE Compiler and Runtime System both limits and focuses the RC subproject. The PACE compiler is a source-to-source optimizing compiler. While the system will be capable of generating native code for a limited set of target processors, the primary mode of operation for the compiler is to generate the transformed program as a C program. This strategy, dictated by the AACE program goals, provides portability. It also prevents the PACE compiler from applying some optimizations, such as instruction scheduling. These limitations, in turn, provide a sharp focus for the RC subproject; the RC tools should not measure characteristics that the other PACE tools cannot use.

With this restriction, the RC project still has many characteristics that it must measure. (See Table 2.2 for a full list of the characteristics measured in Phase 1 of the AACE Program). As an example, consider the information needs of the PAO’s non-polyhedral loop transformations. The transformations need to know the geometry of the cache hierarchy—that is, for each level of the hierarchy, the size, the associativity, and the granularity (line size or page size) of that level. Thus, the RC tools should derive those numbers.

Why not obtain the numbers from reading the manufacturer’s documentation? The AACE pro-
gram depends on a strategy of deriving these characteristics rather than supplying them in some configuration file. Our experience, over the first year of the RC project, suggests that this strategy is actually critical, for several reasons.

1. The compiler needs to understand the characteristics as they can be seen from a C source program. For example, the documentation on a multicore processor may list the level two data cache size as 512 kilobytes.1 The amount of level two cache available to the program, however, will depend on a number of factors, such as the size of the page tables and whether or not they are locked into the level two cache, the number of processors sharing that level two cache, and the sharing relationship between the instruction and data cache hierarchies. In short, the number in the documentation would mislead the compiler into blocking for a larger cache than it can see.

2. The documentation, even from the largest manufacturers, is often incomplete or inaccurate. Documentation on the memory hierarchy focuses on the capacity of the largest level; it rarely describes the delay of a level one cache or TLB miss. Equally problematic, the documents provide inconsistent information; we studied one processor manual that provides multiple conflicting latencies for the integer divide operation, none of which match the numbers that our carefully constructed microbenchmark measures.

3. The characteristics themselves can be composite effects that result from the interaction of multiple factors. For example, the AAP and the PAO want to understand the rough cost of a function call for use in the decision algorithms that guide both inlining and outlining. The cost of a function call depends, however, on specific details of the target system’s calling convention, the manner in which the native compiler generates code for the call, the number of parameters and their source-language types, and the presence or absence of optimizations for recursive calls and leaf-procedure calls in the native compiler. The amalgamation of all these factors makes it difficult, if not impossible, to derive reasonably accurate numbers from reading the manufacturer’s manuals.

In addition, the PACE system is intended to adapt itself to both current and future architectures. From this perspective, the design of the RC system should minimize its reliance on idiosyncratic knowledge of current systems and current interfaces. The AACE program assumes that future systems will support the POSIX standard interfaces. Thus, the RC tools rely on POSIX for interfaces, such as a runtime clock for timing, and for information about operating system parameters, such as the page size in the virtual memory system.2 They cannot, however, assume the presence of other runtime interfaces to provide the effective numbers for system characteristics that the PACE compiler and runtime system need. Thus, the PACE Project derives numbers for most of the characteristics used by the PACE Compiler and Runtime System.

2.1.2 Approach

To measure the system characteristics needed by the PACE Compiler and Runtime System, the RC project uses a series of microbenchmarks—small programs designed to expose specific characteristics. Each microbenchmark focuses on eliciting a specific characteristic from the system—from the cost of an integer addition through the ability of the compiler to vectorize specific loops. This approach produces a library of microbenchmark codes, along with a harness that installs and runs those codes.

---

1 Many manufacturers provide an interface that exposes model-dependent system parameters, such as the size and structure of the levels in the memory hierarchy. For example, Intel processors support its cpuinfo protocol. Unfortunately, such facilities vary widely in their syntax and the set of characteristics that they support. PACE cannot rely on their presence.

2 Page size and line size are measurements where the effective size and the actual size are, in our experience, identical.
2.2. FUNCTIONALITY

Development of microbenchmark codes has occupied the majority of our time in Phase 1 of the project. The individual microbenchmarks include both a code designed to elicit the effect and a code that analyzes the results and reduces them to one or more characteristic values. Developing the Phase 1 microbenchmarks has been challenging. Designing a code to elicit the desired effect (and only that effect) has required, in every case, multiple iterations of the design-implement-test cycle. In many cases, it has required the invention of new measurement techniques. The analysis of results can be equally challenging. The experimental results that expose a given effect contain noise. They often expose interference from other effects. The data analysis problems are, in some cases, harder than the problem of exposing the effect.

The result of this effort is a library of microbenchmarks that both elicit system behavior and analyze it. Those codes, written in a portable style of C, rely on the POSIX interface to system resources and on a handful of common POSIX tools, such as the make utility. In their current state, they provide the compiler with a sharp picture of the resources available on a new system.

2.2 Functionality

2.2.1 Interfaces

The primary task of the RC tools is to produce data used by the other major components of the PACE system: the PACE Compiler, the PACE Runtime System, and the PACE Machine Learning tools. As shown in Figure 2.1, the RC tools take as their primary input a system configuration file. The tools use the native C compiler, system calls supported in the POSIX standard, and some additional software tools, as specified in Table 2.1.

2.2.2 Inputs

The primary input to the RC tools is the configuration file for the target system. This file, whose format is specified by the AACE Task 2 teams, contains basic information on the target system. The RC tools need for this information, supplied in a human-written configuration file, to include at least:

1. The location, name, and invocation sequence for the native compiler and linker. The RC tools need syntax to invoke the native compiler, link against standard libraries, create an executable image, run that image, and connect disk files to input and output streams. (Under POSIX systems with the bash shell and the standard C libraries, much of this knowledge is standard across systems.)
2. A specific command set to compile and link for vectorization, if it is supported by the native compiler. This command set must be distinct from the default command set.

3. A set of optimization flags to use with the native compiler during resource characterization. These flags are provided by the system installer in the configuration file. These flags must include the options necessary to produce the appropriate behavior from each microbenchmark. In gcc terms, the flag `–O2` appears to be sufficient.

4. Basic information on the target system, as described on page 10 of DARPA BAA-08-30, which solicited proposals for the AACE program. This information includes microprocessors and their components; number of cores; clock rate(s) of the processors; memory architecture on a processor; memory architecture on a node or system; number of chips (memory and processors) per node; interconnection of nodes; and composition of the processing system.

### 2.2.3 Output

The PACE RC tools produce, as output, a set of measured characteristic values. Those values are available in two forms: an XML-format file that is consumed by the AACE Task 2 teams’ grading programs, and an internal format used by the interface that the RC tools provide for the PACE Compiler, Runtime System, and Machine Learning tools. § 2.2.1 provides more detail on these interfaces.

Table 2.2 shows the characteristic values measured by the PACE RC tools submitted for the end-of-phase trials in Phase 1 of the project (the first eighteen months). These characteristics range from broad measurements of target system performance, such as the number of compute-bound threads that the processor can sustain, through microarchitectural detail, such as the latency of an integer divide operation. Each characteristic will be used elsewhere in the PACE system. Table 2.4 gives a partial list of those uses. As development work on the PACE Compiler and Runtime System proceed in Phase 2 of the project, we anticipate adding to the list of characteristics.

Note that the PACE RC tools do not report units of time. Varying timer precision on different systems and the possibility of varying runtime clock speeds make it difficult for the tools to report time accurately. Instead, we report latencies in terms of the ratio between the operation’s latency and that of integer addition.

The lists shown in Table 2.2 do not include all of the characteristics that the PACE project will measure. We have additional microbenchmarks in a variety of stages of development, shown in Table 2.3. These codes will be released and documented as they reach a stable state. Neither table includes measurements of bandwidth. PACE will use the industry-standard STREAM benchmarks as our initial tool to measure bandwidth.

<table>
<thead>
<tr>
<th>Item</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>C compiler</td>
<td>Native compiler, as specified in the configuration file; must be able to produce an assembly-code listing</td>
</tr>
<tr>
<td>MPI library</td>
<td>Standard-conforming MPI library, with location and linker flags specified in configuration file</td>
</tr>
<tr>
<td>OPENMP library</td>
<td>Standard-conforming OPENMP library, with location and linker flags specified in configuration file</td>
</tr>
<tr>
<td>Utilities</td>
<td>Standard Linux commands, including autoconf, automake, awk, grep, make, sed, wc, and the bash shell</td>
</tr>
</tbody>
</table>

*Table 2.1: Software Requirements for the PACE RC Tools*
Table 2.2: PACE Characteristics Submitted to Phase 1 End-of-Phase Trials

While the RC tools should, in principle, measure all those properties of the target system and its software environment that have an impact on optimization in the PACE Compiler and Runtime System, the project’s schedule and the available personnel limit, in practice, the number of characteristics that we can measure.

The Phase 1 characteristics ignore the effect of load on the system. Phase 2 will expand the coverage of measured characteristics in several ways. We will explore the impact of load and sharing on some of the Phase 1 characteristics. We will expand the set of compiler properties that the tools measure. (The compiler characteristics will govern how the AAP, PAO, and TAO express the computation and will allow those tools to tailor the form of the transformed C source code to the strengths and weaknesses of the native compiler.) Finally, we will include new characteristics for which the compiler discovers a need. For example, the vectorization strategy proposed for the PACE compiler will be more robust if RC discovers hardware vector lengths and alignment restrictions.

2.3 Method

The PACE RC subproject is developing two kinds of tools. The first set of tools uses microbenchmarks to elicit specific behaviors from the target system and analyze the results of those tests to determine characteristic values. The second set of tools takes the results of the microbenchmark tests and produces output data for the two interfaces used by consumers of the characteristic values. The measurement and analysis tools must run under the benchmarking harness provided by the AACE Task 2 teams. That process is controlled by a set of makefiles—inputs to the utility program make. The makefiles for the testing harness also invoke the tool that generates the XML
CHAPTER 2. RESOURCE CHARACTERIZATION IN THE PACE PROJECT

<table>
<thead>
<tr>
<th>Category</th>
<th>Name</th>
<th>Units</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory</td>
<td>Access Cost by Stride</td>
<td>Function</td>
<td>Returns next “sweet spot” in stride in range 1 to page size</td>
</tr>
<tr>
<td>Operations</td>
<td>Procedure Call Cost</td>
<td>Cycles</td>
<td>Working to define what the compiler needs</td>
</tr>
<tr>
<td>MPI</td>
<td>Latencies</td>
<td>Cycles</td>
<td>Latency of MPI &amp; pthread various primitives</td>
</tr>
<tr>
<td></td>
<td>Latency structure</td>
<td>unsure</td>
<td>Assumes int32 add takes one cycle</td>
</tr>
<tr>
<td>Cache</td>
<td>I-Cache Capacity</td>
<td>Bytes</td>
<td>Effective capacity</td>
</tr>
<tr>
<td></td>
<td>I-cache—D-cache sharing</td>
<td>Graph</td>
<td>May need more than Posix</td>
</tr>
<tr>
<td></td>
<td>D-cache sharing</td>
<td>Graph</td>
<td>Needs I-cache benchmark &amp; other support</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Currently runs on Linux systems</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Sharing between cores</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Needs thread affinity support</td>
</tr>
<tr>
<td></td>
<td>Array vs. Pointer Access Costs</td>
<td>Ratio</td>
<td>Microbenchmark written</td>
</tr>
<tr>
<td></td>
<td>Access Costs</td>
<td></td>
<td>Needs further testing</td>
</tr>
<tr>
<td></td>
<td>Vector speedup Access Costs</td>
<td>Ratio</td>
<td>Microbenchmark written</td>
</tr>
<tr>
<td></td>
<td>Multiply-add Access Costs</td>
<td>Boolean</td>
<td>Microbenchmark written</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Needs further testing</td>
</tr>
</tbody>
</table>

Table 2.3: PACE Characteristics in Progress at the End of Phase 1

format file for the Task 2 teams’ grading programs. We are preparing publications, either papers or technical reports, that describe the methods used in the individual microbenchmarks. § 2.3.1 describes one microbenchmark as an example: the method used to compute one set of characteristic values—the number of concurrent integer live values that the target processor and native compiler can support. § 2.3.2 describes the interfaces that the other tools use to access the characterization results.

2.3.1 Producing Characteristic Values

Conceptually, each microbenchmark consists of two distinct pieces: a code designed to expose the characteristic and a code that analyzes the resulting data to derive a value for the characteristic. The complexity of these codes varies from characteristic to characteristic.

When the PACE compiler emits transformed C source code for an application, it should ensure that the transformed source code does not need more registers than the combination of native compiler and target processor can supply. To that end, the PACE RC tools include a microbenchmark that measures the number of concurrently live values that the native compiler can accommodate before it starts to spill those values.\(^3\) This microbenchmark, referred to as the live range microbenchmark, uses some of the same concepts that arise in other microbenchmarks, such as the latency benchmarks for arithmetic operations. The analysis for this microbenchmark is unusual in that it analyzes the assembly-language output of the native compiler (rather than executing the

---

\(^3\)In register allocation, to “spill” means to store a value in memory rather than in a register.
### Value Tool Use

<table>
<thead>
<tr>
<th>Value</th>
<th>Tool</th>
<th>Use</th>
</tr>
</thead>
<tbody>
<tr>
<td>DCache Capacity</td>
<td>PAO</td>
<td>Tiling memory hierarchy&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>DCache Line Size</td>
<td>PAO</td>
<td>Tiling memory hierarchy&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>DCache Associativity</td>
<td></td>
<td></td>
</tr>
<tr>
<td>TLB Capacity</td>
<td>PAO</td>
<td>Tiling memory hierarchy&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>TLB Page Size</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ICache Capacity</td>
<td>PAO, AAP, TAO</td>
<td>Loop unrolling</td>
</tr>
<tr>
<td></td>
<td>PAO</td>
<td>Inlining &amp; outlining decisions</td>
</tr>
<tr>
<td>Operations in Flight</td>
<td>TAO, PAO, TAO, TAO</td>
<td>Compute critical path length for PAO queries</td>
</tr>
<tr>
<td></td>
<td>TAO</td>
<td>Estimate &amp; adjust ILP</td>
</tr>
<tr>
<td></td>
<td>TAO</td>
<td>Instruction scheduling&lt;sup&gt;2&lt;/sup&gt;</td>
</tr>
<tr>
<td>Operation Latency</td>
<td>TAO, PAO, TAO, TAO, RC, TAO</td>
<td>Algebraic reassociation of expressions</td>
</tr>
<tr>
<td></td>
<td>TAO</td>
<td>Operator strength reduction</td>
</tr>
<tr>
<td></td>
<td>TAO</td>
<td>Compute critical path lengths for PAO queries</td>
</tr>
<tr>
<td></td>
<td>TAO</td>
<td>Compute throughput</td>
</tr>
<tr>
<td>Procedure Call Cost</td>
<td>AAP, PAO, TAO</td>
<td>Inlining &amp; outlining decisions</td>
</tr>
<tr>
<td>Compute-bound Threads Memory-bound Threads</td>
<td>PAO</td>
<td>Adjusting parallelism</td>
</tr>
<tr>
<td>Live Values</td>
<td>TAO</td>
<td>Answering PAO queries</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Tailoring C output to native compiler</td>
</tr>
<tr>
<td>Array vs. Pointer Costs</td>
<td>TAO</td>
<td>Tailoring C output to native compiler</td>
</tr>
<tr>
<td>Sequential vs Parallel Loop</td>
<td>PAO, TAO</td>
<td>Rewriting vector computations</td>
</tr>
<tr>
<td>Multiply-add</td>
<td>TAO</td>
<td>Algebraic reassociation</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Compute critical path lengths for PAO queries</td>
</tr>
</tbody>
</table>

<sup>1</sup> Both polyhedral transformations (see § 6) and AST-based transformations (see § 7)

<sup>2</sup> We may modify the scheduler in the native TAO backend, to use derived latencies as a way to improve portability. The TAO’s query backend (see § 9.3.4) may also perform instruction scheduling to estimate execution costs.

---

*Table 2.4: Optimizations That Use Measured Characteristics*
## Chapter 2. Resource Characterization in the PACE Project

<table>
<thead>
<tr>
<th>Code</th>
<th>Live Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td>(\text{add } r_{101} r_{102} \Rightarrow r_{103})</td>
<td>(r_{102}, r_{101})</td>
</tr>
<tr>
<td>(\text{add } r_{102} r_{103} \Rightarrow r_{104})</td>
<td>(r_{103}, r_{102})</td>
</tr>
<tr>
<td>(\text{add } r_{103} r_{104} \Rightarrow r_{105})</td>
<td>(r_{104}, r_{103})</td>
</tr>
<tr>
<td>(\text{add } r_{104} r_{105} \Rightarrow r_{106})</td>
<td>(r_{105}, r_{104})</td>
</tr>
<tr>
<td>:</td>
<td>:</td>
</tr>
</tbody>
</table>

*Stream with Register Pressure of Two*

<table>
<thead>
<tr>
<th>Code</th>
<th>Live Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td>(\text{add } r_{100} r_{102} \Rightarrow r_{103})</td>
<td>(r_{102}, r_{101}, r_{100})</td>
</tr>
<tr>
<td>(\text{add } r_{101} r_{103} \Rightarrow r_{104})</td>
<td>(r_{103}, r_{102}, r_{101})</td>
</tr>
<tr>
<td>(\text{add } r_{102} r_{104} \Rightarrow r_{105})</td>
<td>(r_{104}, r_{103}, r_{102})</td>
</tr>
<tr>
<td>(\text{add } r_{103} r_{105} \Rightarrow r_{106})</td>
<td>(r_{105}, r_{104}, r_{103})</td>
</tr>
<tr>
<td>:</td>
<td>:</td>
</tr>
</tbody>
</table>

*Stream with Register Pressure of Three*

<table>
<thead>
<tr>
<th>Code</th>
<th>Live Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td>(\text{add } r_{102-N} r_{102} \Rightarrow r_{103})</td>
<td>(r_{102}, r_{101}, \cdots, r_{102-N})</td>
</tr>
<tr>
<td>(\text{add } r_{103-N} r_{103} \Rightarrow r_{104})</td>
<td>(r_{103}, r_{102}, \cdots, r_{103-N})</td>
</tr>
<tr>
<td>(\text{add } r_{104-N} r_{104} \Rightarrow r_{105})</td>
<td>(r_{104}, r_{103}, \cdots, r_{104-N})</td>
</tr>
<tr>
<td>(\text{add } r_{105-N} r_{105} \Rightarrow r_{106})</td>
<td>(r_{105}, r_{104}, \cdots, r_{105-N})</td>
</tr>
<tr>
<td>:</td>
<td>:</td>
</tr>
</tbody>
</table>

*Stream with Register Pressure of \(N\)*

*Figure 2.2: How to Use Names to Control Register Pressure in a Stream*
2.3. METHOD

The live range benchmark consists of a suite of codes, each of which is compiled and analyzed. Each individual code contains a stream of operations that is carefully designed to exhibit a specific demand for registers, or register pressure. A “stream” of operations, in the RC context, means a sequence of operations that are constrained by data dependences to execute in a serial fashion. We control the register pressure through the naming scheme used for values in the operations. Figure 2.2 shows how the pattern of names used in a stream of add operations controls register pressure.

The microbenchmark consists of a set of codes, each with one long basic block. The set includes codes that have pressure 2, 3, 4, … 256. The test harness for this microbenchmark compiles each of these codes, using the compiler flag (from the configuration file) that produces an assembler-code listing. The analysis then measures the number of non-comment lines in the assembler code generated by the native compiler for each code. The analysis produces a series of data points, such as shown in Table 2.5.

Each line in Table 2.5 shows the results obtained for the code with a given register pressure; the register pressure is shown in the leftmost column. The next three columns show the number of non-comment lines in the assembler code listing for each of integer registers, floating-point registers, and double-precision floating-point registers.

While the results show minor variation, the major effect for integers occurs at pressure of fifteen, suggesting that the compiler begins spilling after it has used fourteen registers. With floating-point and double-precision values, the significant increase occurs after sixteen registers. Red lines in the table show the significant break points in the data. These results make intuitive sense; the linkage convention reserves a couple of integer registers for its own purposes, such as the activation record pointer and the return address, while all of the floating-point registers are available to the allocator.

Notice the noise in the data. Such noise is a larger factor in benchmarks that rely on measured running times. In designing a microbenchmark, we try to build a code that will exhibit a significant jump in the measured behavior when it reaches the characteristic value. In this microbenchmark, the length of the block determines the size of the jump when the compiler begins to spill. For this microbenchmark, the use of longer blocks amplifies the measured effect and, thus, reduces noise, so longer blocks are better, subject to the twin concerns that compile times tend to rise nonlinearly in the length of the source-code compilation unit, and that overly long blocks cause some compilers to fail. Thus, block length is chosen as a compromise between the desire to have the effect clearly delineated in the data and the problems that arise in compiling longer blocks.

**Coordinating Values Between Microbenchmarks** Some microbenchmarks need access to values computed by other tests. For example, a number of the microbenchmarks need access to the size of the level one cache. To allow for such sharing, the RC tools create a temporary directory and communicate it, using an environment variable, to the individual microbenchmarks. The individual microbenchmarks read and write files in the temporary directory.

The memory hierarchy benchmarks compute many of the characteristic values using multiple diverse tests. A summary pass then compares those results and reports consensus numbers to the software that produces the data for external consumption—either the XML interface or the internal interface to other PACE tools.

2.3.2 Reporting Characteristic Values

Two distinct kinds of tool use the values produced by the RC tools: the AACE Task 2 teams’ testing programs and the other PACE tools. The RC toolset supports a distinct interface for each of these consumers.
### Table 2.5: Measurements from the Live Range Microbenchmark Compiled for an Intel T9600

<table>
<thead>
<tr>
<th>Pressure</th>
<th>Code Length</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Integers</td>
</tr>
<tr>
<td>2</td>
<td>1070</td>
</tr>
<tr>
<td>3</td>
<td>1070</td>
</tr>
<tr>
<td>4</td>
<td>1070</td>
</tr>
<tr>
<td>5</td>
<td>1071</td>
</tr>
<tr>
<td>6</td>
<td>1071</td>
</tr>
<tr>
<td>7</td>
<td>1071</td>
</tr>
<tr>
<td>8</td>
<td>1070</td>
</tr>
<tr>
<td>9</td>
<td>1071</td>
</tr>
<tr>
<td>10</td>
<td>1073</td>
</tr>
<tr>
<td>11</td>
<td>1073</td>
</tr>
<tr>
<td>12</td>
<td>1080</td>
</tr>
<tr>
<td>13</td>
<td>1083</td>
</tr>
<tr>
<td>14</td>
<td>1096</td>
</tr>
<tr>
<td>15</td>
<td>1350</td>
</tr>
<tr>
<td>16</td>
<td>1394</td>
</tr>
<tr>
<td>17</td>
<td>1488</td>
</tr>
<tr>
<td>18</td>
<td>1518</td>
</tr>
<tr>
<td>19</td>
<td>1646</td>
</tr>
<tr>
<td>20</td>
<td>1667</td>
</tr>
</tbody>
</table>

**XML Interface**

The AACE Task 2 Teams (the Blackjack and MAACE teams) have defined an XML schema for reporting results to their testing programs. Accordingly the PACE RC toolset includes a shell script that takes the output of the various RC microbenchmarks and produces an XML report in the format specified by the XML schema. That report is a flat text file.

The RC tools record their values in a flat text file. Values are recorded, one per line, as a pair:

`name, value`

where `name` is a fully qualified name for the characteristic that matches the tags in the XML schema and `value` is the measured value. To produce the XML output file, the RC tools run a bash shell script that reads the `name, value` pairs in the order specified by the Task 2 teams’ schema and constructs the appropriate XML code. The script uses the standard utilities `grep` and `awk`.

**Interface to Other PACE Tools**

The XML interface developed for the end-of-phase tests performed by the AACE Task 2 teams is unsuitable for use by the other PACE tools for several reasons.

- The XML interface is only defined for a subset of the characteristics that we measure; for example, access cost by stride was deemed too complex to represent in a flat format. The other PACE tools need access to the full set of measured values.

- The XML interface forces the consuming program to parse and understand the XML file. The PACE tools need a simpler interface. When a tool needs the size of the level one cache or the...
rough cost of a function call, the query should be simple and direct; it should return a single unequivocal number.

- The XML schema makes no provision for annotations on a value. The interface for the PACE tools will include a facility for annotations that describe exceptional conditions, including but not limited to: the microbenchmark that produces the value failed to report a number; the value produced made no sense; and the value returned is an assumed value rather than a computed value.\(^4\)

- The PACE tools need a stable interface. The XML schema changes to accommodate the needs of the testing and evaluation process. The PACE tools need a simple procedural interface that remains stable over the development cycle of the PACE tools.

To that end, the PACE \(\text{RC}\) tools will provide a procedural interface to the values that they produce. The \(\text{RC}\) tools will store their results in an internal file, hereafter referred to as the \textit{repository}. The procedural interface will use the repository as the source for its values.

The repository will use an expanded version of the XML schema proposed by the Task’2 teams. PACE will add both new characteristics and new attributes (or fields) to the scheme to represent the annotations that the compiler needs. We have not yet completed the design of the expanded schema.

The details of the interface have not been finalized. Our initial prototype will consist of the following procedures:

\textit{Management Functions}

\begin{verbatim}
int rc_init() \hspace{1cm} Initializes the \textit{RC} interface. Returns 1 if successful or a negative number as an error code.

void rc_final() \hspace{1cm} Closes the \textit{RC} interface and deallocates its data structures. Subsequent queries will fail.
\end{verbatim}

\textit{Queries}

\begin{verbatim}
void *rc_query(char *s) \hspace{1cm} s is a string that identifies the characteristic value. The call returns a structure that contains the result.
\end{verbatim}

The format of the structure returned by \textit{rc_query} will evolve as the uses of the \textit{RC}-produced data evolves in rest of the PACE system. The structure will include both values and annotation about the veracity of the values.

\(^4\)For example, if the \textit{RC} tools failed to discover the operating system page size, a value of 4,096 bytes is a reasonable assumption on most POSIX systems.
Chapter 3

An Overview of the PACE Compiler

The PACE compiler lies at the heart of the project’s strategy to provide high-quality, characterization-driven optimization. The compiler uses a series of analyses and transformations to rewrite the input application in a way that provides better performance on the target system. The compiler supports feedback-driven optimization. It works with the RTS to implement runtime variation of optimization parameters. It has a mechanism to incorporate new optimization strategies derived by the ML tools.

3.1 Introduction

The PACE Compiler is a source-to-source optimizing compiler that tailors application code for efficient execution on a specific target system. It accepts as input parallel programs written in C with either MPI or OPENMP calls. It produces, as output, a C program that has been tailored for efficient execution on the target system.

As shown in Figure 1.2, the PACE Compiler is as a series of tools that work together to create the optimized version of the application. Each of these tools is a complex system; each is discussed in its own chapter of the design document (see Table 1.1). This chapter serves as an introduction to the separate tools in the PACE Compiler and their interactions; § 3.3 discusses each of the components. Subsequent chapters provide more detail. This chapter also addresses the high-level, cross-cutting design issues that arise in the PACE Compiler.

3.2 Functionality

While the PACE Compiler is a collection of tools, it presents the end user with functionality that is similar to the functionality of a traditional compiler. The user invokes the PACE Compiler on an input application and the compiler produces executable code. To perform its translation and optimization, the PACE Compiler draws on resources provided by the rest of the PACE system, as shown in Figure 3.1. Because most of these interfaces are internal and most of the components are automatically controlled, the internal complexity of the PACE Compiler is largely hidden from the end user.

3.2.1 Input and Output

The PACE Compiler accepts, as input, an application written in C with calls to MPI or OPENMP libraries. The compiler assumes that the input program is a parallel program; while the compiler will discover some opportunities to exploit parallelism, detection of all available parallelism is not the focus of the PACE project.

Principal Contacts For This Chapter: Keith Cooper, keith@rice.edu, Vivek Sarkar, vsarkar@rice.edu, and Linda Torczon, linda@rice.edu
The PACE Compiler produces, as output, an executable version of the input application, transformed to improve its performance on the target computer system. The compiler has several ways to generate executable code for the application. It can generate transformed C code and use the vendor-supplied native compiler to perform code generation. For some systems, it can use the LLVM backend to generate code (see §3.4). The choice of code generators will depend on the quality of the native compiler and the availability of an LLVM backend for the system.

3.2.2 Interfaces

Figure 3.1 shows the primary interfaces between the PACE Compiler and the rest of the PACE system. The PACE Compiler uses information from several sources.

- **Characterization**: The PACE RC tools measure performance characteristics of the combined hardware/software stack of the target system. The PACE Compiler uses those characteristic values, both to drive optimizing transformations and to estimate the performance of alternative optimization strategies on the target system.

- **Machine Learning**: The PACE ML tools will provide suggestions to the compiler to guide the optimization process. The ML tools communicate those suggestions by modifying the optimization plan for a given application or, perhaps, one of the default optimization plans.

- **Runtime System**: The PACE RTS will provide the compiler with measured performance data from application executions. This data will include detailed profile information. The Runtime System will pinpoint resource constraints that create performance bottlenecks.

- **Optimization Plan**: The PACE Compiler will coordinate its internal components, in part, by using an explicit optimization plan. The optimization plan is discussed in §3.2.3.

- **Configuration File**: The configuration file is provided as part of system installation. It contains critical facts about the target system and its software configuration (see §3.3). Its format is specified by the AACE Task 2 teams.

The compiler embeds, in the executable code, information of use to the Runtime System, such as a concise representation of the optimization plan and data structures and calls to support runtime tuning of optimization parameters (see §5.3.8 and 10.3.4). By embedding this information directly in the executable code, PACE provides a simple solution to the storage of information needed for
feedback-driven optimization and for the application of machine learning to the selection of optimization plans. It avoids the need for a centralized store, like the central repository in the \( R \) system of the mid-1980s. It avoids the complication of creating a unique name for each compilation, recording those in some central repository, and ensuring that each execution can contact the central repository.

### 3.2.3 The Optimization Plan

To provide the flexibility that the PACE Compiler needs in order to respond to different applications and different target systems, the compiler needs a mechanism for expressing and recording optimization plans. The primary mechanism for changing the PACE Compiler’s behavior is to suggest an alternate optimization plan. An optimization plan is a concrete representation of the transformations applied during a compilation. An optimization plan must specify, at least, the following items:

1. The location, i.e., the file system path, to the working directory for the application;
2. the compilation path taken for each RPU (e.g., full compilation path, short compilation path, or LLVM backend path).
3. The transformations that should be applied by the AAP, along with parameters to those transformations;
   For each RPU:
4. the transformations that should be applied by the PAO and the TAO (see Figure 3.3), along with any parameters or commands that control those transformations.
5. The suggested order (if any) of those transformations;
6. additional inputs, if any, to the PAO, Sage→LLVM translator, TAO, and native compiler.
When the user invokes the compiler driver to prepare an application, the compiler driver provides a default optimization plan for the first compile. The executable is prepared using that default plan; the code is executed on user-provided representative data; and the RTS gathers performance statistics on that execution.

On the second compile, the AAP uses performance measurements from the RTS to guide its optimizations; for example, context-sensitive profile information will play a role in decisions about both inline substitution and partitioning the code into RPUs. As it refactors the code, the AAP tries to place together pieces of the program that have similar rate-limiting resources—with the assumption that similar rate-limiting resources imply a similar optimization plan.

The compiler driver then initializes the optimization plan for each RPU. It uses the rate-limiting resource information from each RPU to identify an initial plan for the RPU, drawn from a set of prefabricated plans. Possible rate-limiting resources include cache locality, ILP, multicore parallelism, and register pressure. Whenever the AAP’s partition of the application into RPUs changes, the optimization plans for the RPU will be re-initialized.

The optimization plan guides the compiler components as they translate and transform the individual RPUs, in the second and subsequent compilations. The mechanism to change the behavior of those components is to change the optimization plan. In principle, any component of the compiler can change the optimization plan and affect the behavior of other components. In PACE, we will focus our attention on the kinds of changes described in §1.3.2.

The definitive representation of an optimization plan resides in a file in the application’s working directory. To simplify the handling of information in the PACE tools, the compiler will insert a concise representation of the optimization plan as data into the object code produced by any specific compilation.\(^1\)

### 3.3 Components of the PACE Compiler

The PACE Compiler has a number of major components, shown in Figure 3.2.

**Compiler Driver**

The compiler driver provides the application programmer with their interface to the PACE Compiler. To compile a program, the programmer creates a directory that contains the source code for the application and for any libraries that are to be optimized with it. Next, the programmer invokes the compiler driver on that directory.

The compiler driver has two key responsibilities: managing the compilation process and creating a location for the results of this compilation in the PACE system’s distributed repository. We will defer discussion of the first task until the other components have been described (see §3.4).

The PACE Compiler stores its work products, such as annotations, the optimization plan, intermediate files, and analysis results, in a *working directory* created inside the application’s directory. If the working directory does not yet exist, the compiler driver creates it. Within the working directory, the compiler driver creates a subdirectory for each compilation; it passes the location of this directory to each of the other components that it invokes and ensures that the location is embedded in the final executable code where the RTS can find it. The working directory becomes part of the distributed repository. It contains the records of both compilation and execution for this application.

---

\(^1\) The format of the optimization plan, in the working-directory file and in the object-code file, has not yet been determined.
3.3. COMPONENTS OF THE PACE COMPILER

Application-Aware Partitioner

The first PACE Compiler component invoked by the compiler driver is the Application-Aware Partitioner (AAP). The AAP examines the source code for the entire application and refactors it into refactored program units (RPUs) based on results from previous compilations, from previous executions, and from its own analysis in the current compilation. The AAP has two primary goals:

1. To limit the size of any single compilation unit so that the rest of the compiler can do an effective job of optimization;\(^2\) and

2. To group together procedures that have similar performance profiles and problems so that they can be compiled using the same optimization plan.

The AAP, as part of refactoring, can change the number of implementations for any given procedure using either inline substitution or procedure cloning.\(^3\) It will group together procedures into an RPU that have an affinity—either the same rate limiting resource as identified by the RTS or a frequent caller/callee relationship. It will pad, align, and reorder data structures.

Characterization Data structure padding and alignment transformations in the AAP will use the measured data on the impact of stride on access time. As part of the compiler characterization work in Phase 2, we will attempt to quantify the impact of RPU size on code quality.

Platform-Aware Optimizer

The compiler driver iterates through the RPUs created by the AAP. Each RPU serves as a compilation unit. The driver invokes the Platform-Aware Optimizer (PAO) for each RPU, passing the location of the working directory to it. The PAO applies analyses and transformations intended to tailor the application’s code to platform-wide, or system-wide, concerns. Of particular concern are efficiency in the use of the memory hierarchy and the use of thread-level parallelism. The PAO operates on the code at a level of abstraction that is close to the original C source code.

The PAO finds the optimization plan for the application in the application’s working directory. The PAO can modify the optimization plan executed by the AAP, PAO and TAO components. It modifies the AAP optimization plan across compilations, and the TAO optimization plan within a compilation, generating commands that instruct and constrain the TAO in its code transformations. PAO transformations include loop tiling, loop interchange, unrolling of nested loops, and scalar replacement (see §7). The PAO chooses transformation parameters, for example choosing unroll factors for each loop in a multidimensional loop nest, based on the results of querying the TAO through the PAO–TAO query interface (see § 5.3.5).

Polyhedral Analysis and Transformation Tools The PAO includes a subsystem that uses polyhedral analysis and transformations to reorganize loop nests for efficient memory access (see § 6). The polyhedral transformations use parameters of the memory hierarchy, derived by the PACE RC tools, and the results of detailed polyhedral analysis to rewrite entire loop nests.

Characterization The PAO uses the measured characteristics of the memory hierarchy in both the polyhedral transformations and the non-polyhedral loop transformations. In Phase 2, we will measure dynamic effects in the memory hierarchy, which should refine the effective numbers used in the PAO.

\(^2\)Evidence, over many years, suggests that code quality suffers as procedure sizes grow and as compilation unit sizes grow.

\(^3\)Specifically, it should clone procedures based on values from forward propagation of interprocedural constants in the last compilation. Out-of-date information may cause under-optimization; it will not cause a correctness problem [16].
PAO→TAO IR Translator

Because the PAO and the TAO operate at different levels of abstraction, the PACE Compiler must translate the IR used in the PAO into the IR used in the TAO. The PAO uses the abstract syntax trees in the SAGE III IR. The TAO uses the low-level linear SSA code defined by LLVM.

The translator lowers the level of abstraction of PAO IR, converts it into SSA form, and rewrites it in TAO IR. Along the way, it must map analysis results and annotations created in the PAO into the TAO IR form of the code. The compiler driver invokes the translator for each RPU, and passes it any information that it needs.

Characterization The PAO→TAO IR translator does not directly use characterization data.

Target-Aware Optimizer

The Target-Aware Optimizer (TAO) takes code in IR form that has been tailored by the PAO for the platform-wide performance characteristics and maps it onto the architectural resources of the individual processing elements. The TAO adjusts the code to reflect the specific measured capacities of the individual processors. It also provides optimizations that may be missing in the native compiler, such as operator strength reduction, algebraic reassociation, or software pipelining.

The TAO will provide three distinct kinds of backend.

- On machines where the underlying LLVM compiler has a native backend, such as the x86 ISA, the TAO can generate assembly code for the target processor.
- A C backend for the TAO will generate a source language program in C. The C backend will adjust the code for the measured strengths and weaknesses of the native compiler.
- A query backend for the TAO will answer queries from the PAO. This backend will use a generic ISA, with latencies and capacities established by measured system characteristics.

The TAO is invoked on a specific optimized RPU. One of its input parameters specifies which backend it should use in the current compilation step. Different paths through the PACE Compiler invoke the TAO. The compiler driver can invoke the TAO to produce either native assembly code, using an LLVM backend, or tailored C source code. The PAO can invoke the TAO directly with the query backend to obtain answers to specific queries (see §9.3.4).

The TAO consults the optimization plan, stored in the application's working directory, to guide its actions. The specific actions taken by the TAO in a given invocation depend heavily on (1) the choice of backend, specified by the tool that invokes it; (2) the context of prior optimization, recorded from the AAP, the PAO, and prior compilations; and (3) results from the PACE RC tools.

Characterization The TAO uses measured characteristics, such as the number of allocable registers and the relative cost of different abstractions, to shape the code in the C backend. The compiler characterization work in Phase 2 will directly affect the choice of optimizations in the TAO and the code shape produced by the C backend. In the query backend, it uses microarchitectural characteristics, such as operation throughputs and latencies, to model the code's performance.

3.4 Paths Through the PACE Compiler

The compiler driver can put together the components of the PACE Compiler in different ways. The thick lines in Figure 3.2 show the three major paths that code takes through the PACE Compiler.

Full Compilation Path The compiler driver can invoke the tools in sequence, AAP, PAO, PAO→TAO IR translator, TAO, native compiler. This path corresponds to the centerline of the figure. It
### 3.4. Paths Through the PACE Compiler

<table>
<thead>
<tr>
<th>Application-Aware Partitioner</th>
<th>Platform-Aware Optimizer</th>
<th>Target-Aware Optimizer</th>
</tr>
</thead>
<tbody>
<tr>
<td>Inlining</td>
<td>Dead code elimination&lt;sup&gt;1&lt;/sup&gt;</td>
<td>Dead code elimination&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>Outlining</td>
<td>Dead code elimination&lt;sup&gt;1&lt;/sup&gt;</td>
<td>Superblock cloning</td>
</tr>
<tr>
<td></td>
<td>Inlining</td>
<td>Tail call elimination</td>
</tr>
<tr>
<td></td>
<td></td>
<td>SW branch prediction</td>
</tr>
<tr>
<td>Procedure cloning</td>
<td>Interprocedural constant prop.</td>
<td>Local constant prop.</td>
</tr>
<tr>
<td></td>
<td>Intraprocedural constant prop.</td>
<td>Intraprocedural constant prop.&lt;sup&gt;2&lt;/sup&gt;</td>
</tr>
<tr>
<td>Partial redundancy elimination</td>
<td>Enhanced scalar replacement&lt;sup&gt;3&lt;/sup&gt;</td>
<td>Partial redundancy elimination</td>
</tr>
<tr>
<td>Enhanced scalar replacement&lt;sup&gt;3&lt;/sup&gt;</td>
<td>Algebraic reassociation&lt;sup&gt;3&lt;/sup&gt;</td>
<td>Enhanced scalar replacement&lt;sup&gt;3&lt;/sup&gt;</td>
</tr>
<tr>
<td>Algebraic reassociation&lt;sup&gt;3&lt;/sup&gt;</td>
<td>Idiom recognition</td>
<td>Algebraic reassociation&lt;sup&gt;3&lt;/sup&gt;</td>
</tr>
<tr>
<td>Synchronization reduction</td>
<td>Synchronization reduction</td>
<td>Algebraic simplification&lt;sup&gt;4&lt;/sup&gt;</td>
</tr>
<tr>
<td>If conversion</td>
<td>If conversion</td>
<td>Operator strength reduction</td>
</tr>
<tr>
<td>Scalar expansion</td>
<td>Scalar expansion</td>
<td>Tree-height balancing</td>
</tr>
<tr>
<td>Scalar replacement</td>
<td>Scalar replacement&lt;sup&gt;6&lt;/sup&gt;</td>
<td>Regeneration of SIMD loops&lt;sup&gt;5&lt;/sup&gt;</td>
</tr>
<tr>
<td>PDG-based code motion</td>
<td>PDG-based code motion</td>
<td>Lazy code motion</td>
</tr>
<tr>
<td>Polyhedral transformations</td>
<td>Polyhedral transformations</td>
<td>Prefetch insertion</td>
</tr>
<tr>
<td>Loop transformations</td>
<td>Loop transformations&lt;sup&gt;7&lt;/sup&gt;</td>
<td>Loop unrolling</td>
</tr>
<tr>
<td>Array padding</td>
<td>Array padding</td>
<td></td>
</tr>
<tr>
<td>Reorder structure nesting</td>
<td>Reorder structure nesting</td>
<td></td>
</tr>
<tr>
<td>Array &amp; loop linearization</td>
<td>Array &amp; loop linearization</td>
<td></td>
</tr>
</tbody>
</table>

1. Include multiple forms of “dead” code and multiple transformations.
2. May be redundant in the TAO.
3. Placement in PAO or TAO will depend on experimentation.
4. Includes application of algebraic identities, simplification of predicate expressions, and peephole optimization.
5. Generation of SIMD loops in C source form is tricky.
6. Expanded to include pointer-based values.
7. Will consider unroll, unswitch, fuse, distribute, permute, skew, align, reverse, tile, and shackling. Some combinations obviate need for others. Some overlap with polyhedral transformations.

---

**Figure 3.3: Optimizations Under Consideration for the PACE Compiler**

Invokes all of the tools in the PACE Compiler and directs the TAO to generate tailored C source code as output.

**Short Compilation Path** If the target system has a strong native compiler, as determined by the RC tools, the compiler driver may bypass the TAO, with the sequence AAP, PAO, native compiler. (The compiler driver directs the PAO to emit C source code.) This sequence relies on the native compiler for target-specific optimization.

**LLVM Backend Path** If an LLVM backend is available on the target machine, the compiler driver can invoke a sequence that uses the LLVM backend to generate native code, bypassing the native compiler. In this scenario, it invokes the tools in sequence AAP, PAO, PAO→TAO IR translator, TAO. The compiler driver tells the TAO to use the LLVM backend.

In any of these paths, the PAO can invoke the TAO through the PAO→TAO query interface.
3.5 Optimization in the PACE Compiler

To avoid redundant effort, the PACE Compiler should avoid implementing and applying the same optimization at multiple levels of abstraction, unless a clear technical rationale suggests otherwise. Thus, the AAP, the PAO, and the TAO each have their own set of transformations; most of the transformations occur in just one tool.

Figure 3.3 shows the set of optimizations that are currently under consideration for implementation in the PACE Compiler, along with a tentative division of those transformations among the three distinct levels of optimization in the PACE Compiler. This division is driven by the specific mission of each optimizer, by the level of abstraction at which that optimizer represents the application code, and by the kinds of analysis performed at that level of optimization. We do not expect to implement all of these transformations. Overlap between their effects will make some of them redundant. Others may address effects that cannot be seen in the limited source-to-source context of the PACE Compiler.

The canonical counterexample to this separation of concerns is dead code elimination—specifically, elimination of useless code and elimination of unreachable code. Experience shows that routine application of these transformations at various times and levels of abstractions both reduces compile times and simplifies the implementation of other optimizations. Thus, in the table, they appear in both the PAO and the TAO. While we do not see an immediate rationale for applying them in the AAP, we will test their efficacy in that setting, too.

Some optimizations require collaboration among the various tools in the PACE Compiler. Consider, for example, vectorization. The PAO may identify a loop nest that is an appropriate candidate for vectorization—that is, (1) the loop nest can execute correctly in parallel; (2) the loop executes enough iterations to cover the overhead of vector startup; (3) the PACE RC tools have shown that such a loop can run profitably in vector form; and (4) the configuration file contains the necessary compiler options and commands to allow the PACE Compiler to generate code that will vectorize. When the PAO finds such a loop, it must encode the loop in a form where the remainder of the tools will actually produce vector code for the loop.

In the full compilation path, the PAO annotates the loop nest to indicate that it can run in vector mode. The PAO→TAO IR translator encodes the SIMD operations into the LLVM IR’s vector operations. The TAO generates appropriate code, using either the C backend or the native backend. To ensure that the TAO can generate vector code for these operations, the PAO may need to disable specific TAO optimizations, such as software pipelining, block cloning, tree-height-balancing, and loop unrolling. To do so, the PAO modifies the optimization plan seen by the TAO. Finally, the C backend marks the vectorizable loops with appropriate annotations to inform the vendor compiler that they should be vectorized (e.g., IVDEP).

In the LLVM backend path, optimization proceeds as above, except that responsibility for generating vector operations lies in the LLVM backend. Again, the PAO may disable some TAO optimizations to prevent the TAO from introducing unneeded dependences that prevent vector execution.

In the short compilation path, the PAO annotates the loop nest, as in the full compilation path. In this case, however, the consumer of that annotation is the PAO pass that regenerates C code; it will mark the loop nest with appropriate annotations for the vendor compiler, as found in the system configuration file.

Appendix A provides more detail on our plans for handling vectorization in the PACE Compiler.
3.6 Software Base for the PACE Compiler

The PACE Compiler builds on existing open source infrastructure. The following table shows the systems used in each of the components of the PACE Compiler.

<table>
<thead>
<tr>
<th>Infrastructure Used in the PACE Compiler</th>
</tr>
</thead>
<tbody>
<tr>
<td>AAP</td>
</tr>
<tr>
<td>PAO EDG front end†, Rose, Candl, Pluto, CLooG</td>
</tr>
<tr>
<td>TAO</td>
</tr>
</tbody>
</table>

In the AAP, the open source systems are used to generate components of the AAP. In the PAO and TAO, the actual PACE tools are build as extensions of the open source tools.

† Licensed software
Chapter 4

PACE Application-Aware Partitioner

4.1 Introduction and Motivation

The Application Aware Partitioner (AAP) is the first tool that the PACE Compiler applies to an application. The AAP rewrites the entire application into a set of refactored program units (RPUs) in a way that may improve the ability of later stages of the compiler (the PAO, the TAO, and the native compiler) to produce better code for the application. The specific goals for refactoring are:

1. *To place related functions in the same RPU.* Functions that call one another frequently are placed in the same RPU because that frequent calling relationship should magnify any benefits from interprocedural analysis and optimization.

2. *To limit the size of translation units.* Many optimizing compilers have the property that the quality of optimization degrades with the size of the input file [26]. We limit the size of RPUs both in terms of the number of functions (to limit the size of data structures in interprocedural analysis) and the number of total lines of code (to limit the size of intraprocedural data structures). These limits should also mitigate nonlinear growth in compile time.

The RPUs are chosen by examining both the structure of the application code and execution profile data for sample runs. Careful refactoring should lead to better global and interprocedural analysis on the frequently executed portions of the code. That, in turn, should improve the application’s performance.

4.2 Functionality

We describe the RPUs as a set of functions with the following properties:

1. connected call graph;
2. few calls between RPUs (as a percentage of total calls);
3. small enough number of functions that the compiler can do complex interprocedural analysis.
4. All of the functions in an RPU exist in one translation unit, which is defined by the compiler. This can be a single file or a number of files.
5. A function may be duplicated and placed in multiple RPUs to allow more specific and aggressive optimizations. However, a need to minimize final code duplication needs to be studied.

Principal Contacts For This Chapter: Sebastien Donadio, dose@etinternational.com, Kirk Kelsy, kelsey@etinternational.com, and Vivek Sarkar, vsarkar@rice.edu
The functions in an RPU are ordered according to the original source file ordering. Each function is associated with certain information, such as whether the function appears in other RPUs, and whether the function needs to be accessible outside the RPU.

### 4.2.1 Input

The PACE compiler driver invokes the AAP with the appropriate input, which includes:

1. **Source Code and Compiler Directives**: We assume that the entire source of the application is available, but not any library sources. The AAP consults the application’s optimization plan to determine what optimizations and optimization parameters it should apply. It also uses a *build options file* produced by the compiler driver (see § 4.7).

2. **Profiling**: The RTS profiler generates application profile information, including function call counts and execution time (broken down per function). The profiler builds call trees, where each path in a tree represents an executed sequence of function calls. Each edge in the tree represents one function (the source) invoking another (the sink), and is associated with a count. Profiling information will possibly be derived from multiple input data sets. In this case the call trees (and their associated information) will be merged by the RTS.

3. **Compiler Characteristics**: Compiler characteristics produced by the resource characterization module are used by AAP. For example, the maximum size of an RPU depends on the space efficiency of the compiler’s internal representation and on how well the optimizer scales.

### 4.2.2 Output

The AAP produces as output, the following items:

1. Set of source files, each containing the functions that represent one RPU.
2. Makefile with compilation instructions for each RPU.
3. Graph representation of RPU partitioning (optionally).

It is worth noting that, beyond output item 3 above, the AAP does not explicitly provide any information intended for the user. A programmer might find it useful to have a report of which functions were moved or cloned and into which RPU they were placed; essentially a distillation of the graph representation. The contents and organization are under consideration, but some of this information can be inferred from the preprocessor directives indicating original source code location, which the AAP preserves.

### 4.2.3 Out of Scope

The following functionality is currently out of scope for the PACE deliverables, but may be of interest in the future:

- Incremental compilation and re-optimization in AAP.

### 4.3 Method

The AAP process consists of three primary stages:

1. Call tree processing
2. Partitioning
3. Source code reorganization

Figure 4.1 depicts these stages and data objects passed between them. During execution, the AAP maintains all state in memory. Although the representation is concise, it is possible for the AAP to exceed the available system memory during execution.

4.3.1 Call Tree Processing: Stage 1

The RTS produces profile results stored in an XML file (see § 10.2.3 and 10.3.2). The AAP locates this file (in the application’s working directory) and parses it, using C language XML parsing library xerces. Parsing extracts call trees and performance metrics associated with each path.

4.3.2 Partitioning: Stage 2

Partitioning seeks to group together those functions that are interdependent, while limiting the size of each partitioned unit. The resulting partitions should also have limited dependence on one another. The interdependence between functions is measured in terms of function calls. The size limit of a partition depends on which transformations are specified in the optimization plan. For example, if loop unrolling is performed, this causes an increase in the number of instructions in a function. (Some functions are data cache limited and others are instruction cache limited.)

The partitioning framework receives a graph representation of the target application from Stage 1 in libgraph format. This graph is then divided into sub-graphs. The specific algorithm and heuristics used to perform this partitioning are separate from the AAP framework itself, and is a subject of further research.

As we gain experience with the PACE compiler, we may develop a partition heuristic that groups together functions based on the similarity of their optimization plans. This approach might improve the efficiency of subsequent optimization and compilation. We also plan to explore the use of feedback from the PAO’s interprocedural analysis to the AAP. For example, the PAO might specify...
an affinity (to keep two functions together), a repulsion (to separate two functions), or a replication
(clone a procedure).

The current heuristic limits an RPU to less than 10 functions and cut edges with a small num-
ber of calls. The RPU size limit can be specified in terms of function count or lines of code. Each
resulting sub-graph then represents some set of functions, which are output as an RPU. The final
graph representation can be stored in a file using DOT representation, which can loaded by later
invocations to avoid recomputing the partitions.

Figure 4.2 illustrates a simple call tree, with nodes and edges representing functions and their
calls respectively. The numeric annotation in each node represents the number of invocations
made by its parent function.

4.3.3 Source Reorganization: Stage 3

The reorganization stage generates the final set of source files output by the AAP. For each RPU
description generated by Stage 2 the reorganizer must locate the indicated functions in the original
source code. In addition to the functions themselves, the AAP must identify ancillary code required
by each function such as type definitions, function declarations, and global data objects.

Source code reorganization is implemented with a parser built using the Bison parser generator
and Flex lexical analyzer. Prior to reorganization the C preprocessor is applied to each source file
to ensure that all necessary definitions are in place. The resulting file is then parsed to determine
the location (character position) of various elements, and the inter-dependencies between these
elements.

Refactoring the code can necessitate changes in the visibility of some functions. For example,
functions that were originally defined as static may be called, in the refactored code, from func-
tions in another RPU. To accommodate this kind of change, the AAP generates a wrapper function
for any static functions that it must expose outside of their new RPU. Each such wrapper function
receives a unique name. Calls to the static function from another RPU are adjusted to use the ap-
propriate wrapper. If a function appears in multiple RPU's because the AAP has replicated it, the AAP selects one instance as the primary implementation and marks all of the other clones as static to prevent name conflicts. If the function is an entry point to the RPU (i.e., the caller of the function is in a different RPU), then the function is given a wrapper with a unique name. When selecting the primary implementation of a function, the AAP gives preference to a clone that serves as an entry point to its RPU; doing so helps avoid some unnecessary name mangling.

4.4 Results

The application aware partitioner has been successfully applied to the SPEC CPU 2006 benchmark suite. The impact of partitioning is tested by comparing execution time of these benchmarks when compiled using a variety of RPU size constraints including the extreme cases of placing each function in a separate RPU as well as attempting to place all functions in a single RPU (i.e., infinite RPU size).

These experiments have been performed using gcc 4.4. In addition to the execution time, we compare the partitioned code size, and executable size resulting from each case. One may note that the partitioned code size is often lower than the original code size, even when the RPU size is small. This is a side-effect of the AAP importing only those declarations necessary for each RPU to compile. The executable size is effected by the AAP’s creation of multiple copies of some functions and creating wrapper functions to protect functions with static linkage.

4.4.1 SPEC Benchmarks

We tested the AAP using the SPEC CPU 2006 integer benchmarks. Figure 4.3 depicts the results of four sets of source code:

- Base: unmodified code from SPEC benchmark
- RPU\(_1\): partitioned with a maximum of 1 function per RPU
- RPU\(_{10}\): partitioned with a maximum of 10 functions per RPU
- RPU\(_\infty\): entire unpartitioned graph placed into a single RPU

4.5 Summary

We have developed an Application Aware Partitioner, which restructures source code into translation units more suitable for optimization during later compilation. The size of each translation unit can be specified for each invocation of the AAP, based on the measured constraints of the native compiler, such as the growth of compile time as a function of RPU length, and the measured resource constraints of the processor, such as instruction cache size. The AAP has been successfully applied to the SPEC CPU 2006 benchmark suite, which exhibits a variety of code styles (e.g., a mix of ANSI and K&R style). The impact of code partitioning depends on the subsequent compilation process used, and the partitioning itself should be tailored to it. The results of the AAP component will be more interesting once it is being used to produce RPU's for a specific purpose.

4.6 Command Line Options

4.6.1 Build Options File

Command line: \( -b, --build-file=<file-name> \)
Default: build-options.xml

\(^1\) Only those functions identified by the profiling system are considered.
The build options file indicates what compiler command line options should be used for each source file. These options are associated with each output RPU file, and used when pre-processing the input files. The format for this file is specified in § 4.7.

### 4.6.2 Pruning

**Command line:** `-c, --prune-threshold=< % >`

**Default:** [no pruning]

Specifies a threshold for pruning graph nodes. The value represents a percentage of the total running time (as reported by the profile). Any call-tree leaves accounting for less than this portion of run-time are pruned.

### 4.6.3 Function Limit

**Command line:** `-f, --max-functions=#`

**Default:** 10

Figure 4.3: SPEC CPU 2006 benchmark results

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Base</th>
<th>RPU_1</th>
<th>RPU_{10}</th>
<th>RPU_{∞}</th>
</tr>
</thead>
<tbody>
<tr>
<td>perlbench</td>
<td>time</td>
<td>478.42 s</td>
<td>475.85 s</td>
<td>480.86 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>375955</td>
<td>898110</td>
<td>679884</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>1321 kB</td>
<td>3075 kB</td>
<td>2650 kB</td>
</tr>
<tr>
<td>bzip2</td>
<td>time</td>
<td>199.83 s</td>
<td>198.49 s</td>
<td>199.27 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>17410</td>
<td>14359</td>
<td>12241</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>84 kB</td>
<td>104 kB</td>
<td>99 kB</td>
</tr>
<tr>
<td>gcc</td>
<td>time</td>
<td>111.83 s</td>
<td>111.49 s</td>
<td>110.88 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>1134216</td>
<td>104913</td>
<td>1017251</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>3621 kB</td>
<td>3788 kB</td>
<td>3779 kB</td>
</tr>
<tr>
<td>mcf</td>
<td>time</td>
<td>1561.52 s</td>
<td>1931.35 s</td>
<td>1947.52 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>44722</td>
<td>11639</td>
<td>8931</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>27 kB</td>
<td>60 kB</td>
<td>43 kB</td>
</tr>
<tr>
<td>milc</td>
<td>time</td>
<td>1113.85 s</td>
<td>1134.43 s</td>
<td>1168.15 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>119607</td>
<td>70346</td>
<td>59320</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>139 kB</td>
<td>453 kB</td>
<td>487 kB</td>
</tr>
<tr>
<td>gobmk</td>
<td>time</td>
<td>375.09 s</td>
<td>387.86 s</td>
<td>389.32 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>283671</td>
<td>642638</td>
<td>538406</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>4076 kB</td>
<td>7878 kB</td>
<td>7208 kB</td>
</tr>
<tr>
<td>hmmer</td>
<td>time</td>
<td>923.91 s</td>
<td>931.73 s</td>
<td>923.55 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>167000</td>
<td>145207</td>
<td>144833</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>357 kB</td>
<td>357 kB</td>
<td>357 kB</td>
</tr>
<tr>
<td>libquantum</td>
<td>time</td>
<td>1303.80 s</td>
<td>1299.48 s</td>
<td>1322.04 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>18090</td>
<td>16375</td>
<td>13835</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>54 kB</td>
<td>85 kB</td>
<td>84 kB</td>
</tr>
<tr>
<td>lbm</td>
<td>time</td>
<td>1282.16 s</td>
<td>1290.57 s</td>
<td>1282.95 s</td>
</tr>
<tr>
<td></td>
<td>lines</td>
<td>3250</td>
<td>1961</td>
<td>1499</td>
</tr>
<tr>
<td></td>
<td>binary size</td>
<td>26 kB</td>
<td>27 kB</td>
<td>26 kB</td>
</tr>
</tbody>
</table>
4.6. COMMAND LINE OPTIONS

Specifies the maximum number of functions allowed in an output RPU.

4.6.4 Graphs

Optional command line: \texttt{-g, --graphs | -G, --graphs-only}

Default: [off]

Instruct the AAP to output a representation of the intermediate graphs (the call tree before and after partitioning). The output files will be in DOT format in files named \texttt{aap.dot} and \texttt{rpu.dot} in the output directory (\S 4.6.7). If graph pruning is enabled (see 4.6.2) then an addition \texttt{aap.pruned.dot} is generated. The option \texttt{-G} additionally causes the AAP to terminate after partitioning the graph (as a result no \texttt{final.dot} file is generated).

4.6.5 Line Limit

Command line: \texttt{-l, --max-lines= #}

Default: [unlimited]

Specifies the maximum lines of code allowed in an output RPU.

4.6.6 Program Name

Command line: \texttt{-n, --program-name=<name>}

Default: \texttt{a.out}

Specifies the name to give the final compiled program in the generated Makefile.

4.6.7 Output Directory

Command line: \texttt{-o, --output=<path>}

Default: /tmp/RPUS/

Indicate where all output files should be placed.

4.6.8 Partitioner Type

Command line: \texttt{-p, --partitioner=<type>}

Default: MinCut

Options: MinCut, Recombine

Indicates partitioning options. The min-cut partitioner repeatedly divides the graph by cutting the minimum-weight (call count) edge. The recombine option invokes min-cut and then attempts to fuse some undersized RPU (a side-effect of the greedy min-cut algorithm).

4.6.9 Array Padding

The array padding transformation will rely on information from the PACE Resource Characteriza-
tion tools. Specifically, we are measuring access cost as a function of stride and will provide an interface to the compiler that takes a given array size and returns the nearest sweet spot for stride (above the size) and the expected decrease in access cost. RTS will provide some detailed information on which arrays have the most misses.

Command line: \texttt{--pad-arrays}

Default: [off]
When specified, activates array-padding analysis (see § 4.8).

### 4.6.10 RPU Graph

**Command line:** `-r, --rpu-graph=<path>`

**Default:** [off]

Specifies an RPU graph as input and implies that graph partitioning should not be done.

### 4.6.11 Verbose

**Command line:** `-v, --verbose`

**Default:** [off]

Generates extra status information during execution.

### 4.6.12 Profile

The final command line argument should be the path to the database of profile information produced by the RTS for this application. The AAP will use that profile information to drive the partitioning algorithm.

### 4.7 Build Options File Format

The compiler driver, when it creates the working directory, should construct a build options file, formatted in XML. Each input source file should be indicated with a `<file>` node with an attribute name `<path>`, the value of which should be the absolute path to said file. Each file node may optionally contain children `<option>` nodes with attributes `name` and `value`.

```xml
<?xml version="1.0"?>
<buildoptions>
  <file path="/path/to/file.c">
    <option name="-g" value="" />
    <option name="-O" value="3" />
  </file>
</buildoptions>
```

In addition to the individual file nodes, the build options can contain a `<globaloptions>` node containing some number of option nodes. These options apply to all source files.

### 4.8 Array Padding

The AAP is capable of detecting data arrays that can safely be padded for improved cache effects. The array padding analysis excludes any array where:

- The array is passed as a function argument
- The address of the array is taken
- The array is used in a cast expression
- An unsafe expression is used as index into the array

We consider an expression to be safe for array indexing as long as it is:

- A constant
• A variable assigned a constant value
• Simple increment or decrement of such variable
• Mathematical addition or subtraction of the any of the above

The AAP only reports arrays that are global to the program (e.g., that are used across multiple RPU). Data arrays that are only used within a single RPU are explicitly made static by the AAP to allow later PACE compiler stages to perform deeper, interprocedural analysis.

When array padding analysis is active, a report is generated as array-padding.xml in the output directory (see § 4.6.9). The paddable arrays are reported as children of a <array-uses> node. Each array is represented as an <array> node with a name attribute indicating the array name. Each array node will have one <rpu> node with a name attribute matching the RPU name.

Array padding must be applied consistently across all RPU that declare the array. To facilitate partial compilation the array padding analysis reports the sets of RPU that are mutually dependent. The array padding output includes a <rpu-dependencies> node containing some number of <rpu-group> nodes. Each rpu group is represented by some number of <rpu> nodes as described above.
Chapter 5
PACE Platform-Aware Optimizer
Overview

5.1 Introduction

Figure 1.2 provides a high-level overview of the PACE Compiler design. This chapter provides an overview of the design for the Platform-Aware Optimizer (PAO). The PAO component performs transformations and optimizations on a high-level, or near-source, representation of the code, which we will refer to as a high-level intermediate representation (HIR). The HIR enables a comprehensive set of analyses and transformations on both code and data. Because the Rose system serves as the primary infrastructure on which the PAO is build, the SAGE III IR from Rose serves as the HIR for PAO. The motivation for having separate Platform-Aware and Target-Aware Optimizers is to support both transformations that must be performed at the near-source level and transformations that might be better performed at a much lower level of abstraction.

5.2 Functionality

The PAO takes as input refactored program units (RPUs) generated by the Application-Aware Partitioner (AAP). It generates as output transformed versions of the RPU. The transformed versions have been optimized for measured characteristics of the target platform, identified by the Resource Characterization (RC) component, as well as by the configuration file for the target platform. Figure 5.1 illustrates the ways that PAO interacts with other parts of the system.

5.2.1 Input

The primary input for a single invocation of the PAO is C source code for an RPU, as generated by the AAP. Additional inputs (as shown in Figure 5.1) include compiler directives from the compiler driver, resource characteristics for the target platform, profile information with calling-context-based profile information for the source application, and TAO cost analysis feedback (Path 3 in Figure 5.2).

5.2.2 Output

As its output, the PAO produces a transformed HIR for the input RPU. The transformed code can be translated into either C source code or into the IR used in the Target-Aware Optimizer (TAO). This latter case uses the PAO→TAO IR translator, described in Chapter 8; the translator is also used to translate from the SAGE III IR to the LLVM IR in the in the PAO–TAO query mechanism, as shown in Figure 5.2.

Principal Contacts For This Chapter: Zoran Budimlić, zoran@rice.edu, and Vivek Sarkar, vsarkar@rice.edu
As described in § 3.3, the compiler driver invokes the PAO, passing the location of the working directory. The PAO consults the optimization plan for the application, and can modify its own optimization plan to determine how it should process the RPU (see § 3.2.3). It can also modify the optimization plan for the AAP across compilations to derive an RPU partitioning strategy that enables optimizations in the PAO; and can modify the optimization plan for the TAO within a single compilation, to instruct and constrain the TAO in its code transformations.

The two primary compilation paths through the PAO are shown as Path 1 and Path 2 in Figure 5.2.

Path 1 of Figure 5.2 shows how the PAO implements it’s part of both the full compilation path and the LLVM backend compilation path, as described in § 3.4. Along with optimized user code in SAGE III IR, the PAO produces auxiliary IR information, including profiling, aliasing, and dependence information. It may also amend the application’s optimization plan, which determines the code transformations performed by the PAO and TAO for this RPU. Next, the compiler driver invokes the PAO→TAO IR translator to convert the SAGE III IR into the LLVM IR used in the TAO. The translator associates the auxiliary information from the SAGE III IR with the new LLVM IR for the RPU. Finally, the compiler driver invokes the TAO, which uses the LLVM IR, the auxiliary information associated with it, and the optimization plan; the TAO performs further optimization and produces executable code for the RPU.

Path 2 of Figure 5.2, shows the flow of information when the PAO queries the TAO for cost estimates to guide its high-level transformations. To perform a TAO query, the PAO constructs a synthetic function for a specific code region. The synthetic function contains a transformed version of the application’s original code for which the PAO needs a cost estimate. The PAO uses the PAO→TAO IR translator to convert the synthetic function into LLVM IR and it invokes the TAO with a directive to use the query backend. Note that, on this path, the PAO directly invokes both the PAO→TAO IR...
5.3 Method

In this section, we include design details for the PAO component of the PACE compiler. These details follow from a basic design decision to use the Edison Design Group (EDG) front end for translating C source code, and the Rose transformation system with the SAGE III IR for performing high level transformations.

After the SAGE III IR is created by the front end (§ 5.3.1), the passes described below in Sections 5.3.2 - 5.3.5 are repeated until no further transformation is performed (a fixed point is reached) or until a predefined maximum number of iterations is reached. At that point, a transcription phase produces either transformed C source code or LLVM IR bitcode (see § 5.3.6).

The PACE design enforces a strict separation of concerns among three aspects of performing each transformation: 1) Legality analysis, 2) Cost analysis, and 3) IR update. This design makes it possible to modify, replace, or consider multiple variants of any one aspect of a transformation without affecting the others. As described in Chapters 6 and 7, there are two main modes of implementing transformations in the PAO. Chapter 6 summarizes the use of a polyhedral transformation framework that is capable of selecting a combination of transformations in a single unified step.
Chapter 7 summarizes the classical transformations that the PAO can apply directly to the SAGE III IR. These two modes complement each other since neither one subsumes the other.

5.3.1 Front end

The PAO relies on the EDG front end to parse the input C source code and to translate the program to a SAGE III IR. The PACE Project will not modify the EDG front-end, so we can use pre-compiled binaries of the EDG front-end for distribution.

5.3.2 Program Analyses

Before the PAO begins to explore a wide space of transformations to optimize the input program, it needs to perform some canonical program analyses to broaden the applicability of transformations as much as possible. These analyses include:

- Global Value Numbering identifies equivalent values.
- Constant Propagation identifies compile-time constants.
- Induction Variable Analysis identifies induction variables; it builds the two prior analyses.
- Unreachable Code Elimination identifies code that has no feasible execution path—that is, no path from procedure entry reaches the code.
- Dead Code Elimination identifies code that creates values that are never used.

The PAO includes an extensive framework for polyhedral analysis and transformation, described in Chapter 6. That framework depends in critical ways on the fact that the code has already been subjected to the canonical analyses outlined above. In many cases, a loop nest as expressed in the input application may not a priori satisfy the constraints to be a Static Control Part (SCoP) in the polyhedral framework ( § 6.2.1). Often, the canonical analyses transform such loop nests into a form where they satisfy the constraints needed for application of the polyhedral techniques.

Even though the canonical analyses are all well understood, their implementation in the PAO poses new research challenges for four reasons. First, the analyses in the PAO must handle both array and pointer accesses whereas previous implementations have typically restricted their attention to scalar variables and their values. Second, the PAO must analyze code that is already in parallel form (OPENMP and MPI extensions to C) whereas prior work assumes sequential code. Third, the PAO will attempt to combine these analyses to the fullest extent possible. Prior work has shown that combining analyses can produce better results than computing them independently; again, prior work has explored combined analysis of scalar variables and values. Finally, the PAO will build analyses that can be incrementally updated after each transformation is performed, which should significantly reduce the overall cost of the analyses.

The polyhedral framework also plays a role in the process of identifying, transforming, and expressing loops that can execute in vector form. The polyhedral framework identifies such loops in a natural way. It computes data dependence information for the memory accesses in the loop and its surrounding code and identifies consecutive memory accesses. This information, created in the PAO’s polyhedral analysis framework, is passed to the TAO as annotations to the IR by the PAO→TAO IR translator. The TAO uses the information to generate vector code. Appendix A provides a system-wide view of vectorization in the PACE Compiler.

5.3.3 Legality Analysis

A necessary precondition before a compiler can perform a code transformation is to check the legality of the transformation. Legality conditions for a number of well-known transformations have
5.3. METHOD

been summarized in past work e.g., [3, 62]. We summarize below the legality conditions for many of the transformations that will be performed by the PAO. While much of the research literature focuses on the data dependence tests for legality, it is also necessary to check control dependence and loop bound conditions to ensure correctness of loop transformations. The following list summarizes the legality conditions for the transformations described in this chapter.

- Loop interchange: no resulting negative dependence vector, counted loop with no premature exit, loop-invariant or linear loop bounds with constant stride (or loop invariant-loop bounds with loop-invariant stride for loops being interchanged)
- Loop tiling: same conditions as loop interchange for all loops being permuted
- Unroll-and-jam: same data dependence conditions as loop tiling, but loop bounds must be invariant
- Loop reversal: no resulting negative dependence vector, counted loop with no premature exit, arbitrary loop bounds and stride
- Unimodular loop transformation (includes loop skewing): counted loop with no premature exit, no loop-carried control dependence, loop-invariant or linear loop bounds with constant stride
- Loop parallelization: no resulting negative dependence vector, counted loop with no premature exit, arbitrary loop bounds and stride
- Loop distribution: no control + data dependence cycle among statements being distributed into separate loops
- Loop fusion: no loop-independent dependence that prevents code motion to make fused loops adjacent, and no loop-carried fusion-preventing dependence
- Scalar replacement: no interfering aliased locations
- Constant replacement: variable must have propagated constant value on all paths
- Scalar renaming, scalar expansion, scalar privatization: no true dependences across renamed locations
- Unreachable code elimination: no feasible control path to code
- Useless (dead) code elimination: no uses of defs being eliminated
- Polyhedral transformation: input loop nest must form a “Static Control Part” (SCoP); see § 6.2.1 for more details

5.3.4 Cost Analysis: Memory Hierarchy

The other precondition that the compiler must satisfy before it performs some code transformation is to check the profitability of that transformation via cost analysis. Cost analysis will play a more central role in the PACE Compiler than in many earlier compilers, because the compiler has better knowledge of the performance characteristics of the target machine, as measured by the RC tools. One particular challenge is to perform effective and accurate memory cost analysis on an HIR such as the SAGE III IR.

Consider the lowest (level 1) levels of a cache and TLB. The job of memory cost analysis is to estimate the number of distinct cache lines and distinct pages accessed by a (hypothetical) tile
of \( t_1 \times \ldots \times t_h \) iterations, which we define as \( DL_{total}(t_1, \ldots, t_h) \) and \( DP_{total}(t_1, \ldots, t_h) \), respectively. Assume that the tile sizes are chosen so that \( DL_{total} \) and \( DP_{total} \) are small enough so that no collision and capacity misses occur within a tile i.e., \( DL_{total}(t_1, \ldots, t_h) \leq \) effective cache size and \( DP_{total}(t_1, \ldots, t_h) \leq \) effective TLB size.

An upper bound on the memory cost can then estimated be as follows:

\[
COST_{total} = (\text{cache miss penalty}) \times DL_{total} + (\text{TLB miss penalty}) \times DP_{total}
\]

Our objective is to minimize the memory cost per iteration, \( COST_{total}/(t_1 \times \ldots \times t_h) \). This approach can be extended to multiple levels of the memory hierarchy.

An upper bound on the memory cost typically leads to selection of tile sizes that may be conservatively smaller than empirically observed optimal values. Therefore, we will also pursue a lower bound estimation of memory costs in the PACE project, based on the idea of establishing a lower bound \( ML \), the minimum cache size needed to achieve any intra-tile reuse. In contrast to \( DL \), the use of \( ML \) leads to tile sizes that may be larger than empirically observed optimal values. The availability of both bounds provides a limited space for empirical search and auto-tuning as opposed to a brute-force exhaustive search over all possible tile sizes.

5.3.5 Cost Analysis: PAO-TAO Query Interface

The \texttt{hir} used in the \texttt{pao} simplifies both high-level transformations and analysis of costs in the memory hierarchy. There are, however, other costs in execution that can only be understood at a lower level of abstraction. Examples include register pressure (in terms of both \texttt{MAXLIVE} and \texttt{spill-cost}, see § 9.3.4), instruction-level parallelism, simdization, critical-path length, and instruction-cache pressure. To address these costs, the PACE Compiler includes a \texttt{pao-tao} query mechanism that lets the \texttt{tao} estimate various costs on specific code fragments. The \texttt{pao} passes the particular cost estimates it needs to the \texttt{tao} through a query data structure. The \texttt{pao} uses the results of such queries to guide the selection of various parameters for the high-level transformations.

To perform a query, the \texttt{pao} creates a synthetic function that encapsulates the transformed code from the region of interest and passes it, along with auxiliary information and a query, to the \texttt{tao}.

The auxiliary information includes information on aliases, data structure alignment, and runtime profile information from the \texttt{rts}. It passes this information to the \texttt{tao}, along with a directive to use the query backend rather than a code-generating backend. For simplicity, each query includes a single synthetic function. If the \texttt{pao} needs to submit multiple queries, they can be submitted separately as a sequence of independent single queries.

For each query, the query backend of the \texttt{tao} (invoked in-core\(^1\) with \texttt{pao}) uses the optimized low-level code and values from the \texttt{rc} tools to produce cost estimates for the synthetic function. The \texttt{tao} records its estimates in the query data structure and passes it back (Path 3 on Figure 5.2) to the \texttt{pao}.

We will evaluate the following three approaches for the synthetic function generation in the \texttt{pao}:

1. Cloning only of the user code region of interest as a separate function. This will require local variables from surrounding context to be passed as reference parameters. The advantage of this approach is that there will be minimal code duplication, the disadvantage is that it will lead to a less precise analysis of cost results due to the absence of the exact context for local variables.

\(^1\)The \texttt{tao} shares a process and an address space with the \texttt{pao} to facilitate \texttt{tao} access to \texttt{pao} generated annotations and data structures. This arrangement should reduce the cost of making a \texttt{pao-tao} query.
2. Cloning of the user code region of interest along with its control and data slice from the function. This will include full intraprocedural context for the slice of the code region. This approach would produce more precise query results, but there will be some pollution of cost information by other code regions in function due to conservative slicing.

3. Cloning of the entire function containing the specific user code region. This will include full intraprocedural context for the code region. This approach would yield the most precise query results, but there still will be some pollution of cost information by other code regions in the function.

We anticipate that the second approach will yield the best cost results, but the matter needs further study and evaluation. Open questions, such as how well the synthetic function captures effects from code that surrounds the synthetic function, complicate the situation. For example, the presence of values that are live across the code captured in the synthetic function but are not present in that region can introduce inaccuracy in the results. Ignoring these values will pollute the cost estimation, while isolating them, via some technique such as live-range splitting around the region, may alter the final code.

The PAO interprets query results and chooses transformation parameters based on TAO feedback, for example choosing unroll factors for each loop in a multidimensional loop nest.

5.3.6 Transcription

When the PAO has finalized its selection of transformations, the SAGE III IR is updated to finalize all transformations. At that point, a transcription phase uses the SAGE III IR to either generate source code or LLVM IR. The compiler driver then uses the PAO→TAO IR translator to generate LLVM bitcode as input input to the Target-Aware Optimizer (TAO).

5.3.7 The Optimization Plan

The PAO uses the application’s optimization plan both to obtain guidance in its own operation and to affect the processing of code in other parts of the PACE Compiler. The initial optimization plan is generated by the compiler driver (see §3.2.3). It may be modified by other components of the PACE system, including the PACE Machine Learning tools.

The PAO will modify the optimization plan for an application to guide the application of transformations in both the AAP and the TAO. It modifies the AAP optimization plan across compilations, and the TAO optimization plan within a compilation. For example, if analysis in the PAO and the PAO–TAO query interface shows that some loop nest has unsupportable demand for registers and that the loop nest was produced by inline substitution in the AAP, the PAO may direct the AAP to avoid inlining in that function or that RPU. Similarly, on a loop that the PAO identifies as vectorizable, the PAO may direct the TAO not to apply transformations, such as tree-height balancing, that might change the code structure and prevent vector execution.

5.3.8 PAO Parameters for Runtime System

The PAO might determine that certain parameters can most likely benefit from runtime optimization. The PAO will present the RTS with a closure that contains an initial parameter tuple, a specification of the bounds of the parameter tuple space, a generator function for exploring the parameter tuple space, and a parameterized version of the user’s function to invoke with the closure containing the parameter tuple and other state. Additional detail about the RTS interface for online feedback-directed parameter selection can be found elsewhere (§10.3.4).

To illustrate how the PAO will exploit RTS support for online feedback-directed parameter selection, consider the problem of Parametric Tiling described in §6.3.6. Here, the PAO recognizes the
need to block the code to improve performance in the memory hierarchy, but it cannot determine
the optimal tile size at compile time. In this case, the PAO will prepare the loop nest for runtime
tuning by constructing the inputs for the RTS online feedback-directed parameter selection API and
rewriting the code to use that facility.

5.3.9 Guidance from Runtime System

If the application has already been profiled, the RTS will provide the PAO with high-level quanti-
tative and qualitative guidance about runtime behavior. This information may include data on
resource consumption, on execution time costs, and on specific inefficiencies in the code (see
§ 10.2.3). The PAO will use this information to determine where to focus its efforts and how to
alter the optimization plan to improve overall optimization effectiveness.
Chapter 6
The Polyhedral Framework

The Polyhedral Optimization (PolyOpt) subsystem of PAO (Platform Aware Optimizer) is being developed to perform transformations such as fusion, distribution, interchange, skewing, shifting, tiling, etc. on affine loop nests. The polyhedral transformation approach is based on the Pluto system that has shown good promise for transformation of a number of affine computational kernels. PolyOpt is being implemented as a Sage AST to Sage AST transformer integrated with Rose.

6.1 Introduction

The Polyhedral Optimization (PolyOpt) subsystem of PACE is a component of the PAO (Platform Aware Optimizer). It will enable loop transformations such as fusion, distribution, interchange, skewing, shifting, tiling, etc. to be applied to affine loop nests in a program optimized by PACE. PolyOpt is integrated with Rose. It takes as input the Sage AST representation for the refactored partitioning units (RPU, Sec. 4.2) to be optimized by PolyOpt, identifies subtrees of the AST that represent affine computations, transforms those AST subtrees to a polyhedral representation, performs loop transformations using the polyhedral representation, and finally converts the polyhedral representation back to Sage AST. The Sage AST transformations performed by PolyOpt will be preceded and followed by other (non-polyhedral) transformations described in Chapter 7. The polyhedral transformation approach is based on the Pluto system [13, 14] that has shown great promise for transformation of a number of affine computational kernels.

6.1.1 Motivation

The polyhedral model [34] provides a powerful abstraction to reason about transformations on collections of loop nests by viewing a dynamic instance (iteration) of each assignment statement as an integer point in a well-defined space called the statement’s polyhedron. With such a representation for each assignment statement and a precise characterization of inter- and intra-statement dependences, it is possible to reason about the correctness of complex loop transformations. With the conventional abstractions for data dependences used in most optimizing compilers (including gcc and vendor compilers), it is extremely difficult to perform integrated model-driven optimization using key loop transformations such as permutation, skewing, tiling, unrolling, and fusion across multiple loop nests. One major challenge with AST-based loop transformation systems is the case of imperfectly nested loops; this is seamlessly handled in a polyhedral compiler transformation framework.
### 6.1.2 Background

The input to a transformation system based on the polyhedral model is a region of code containing a sequence of loop nests. Variables that are invariant in the region (e.g., array sizes) are referred to as parameters. The main constraints imposed on the region of code are as follows (see § 6.2.1 for a complete list of constraints). Loop bounds are affine functions (i.e., \( c_1 i_1 + \ldots + c_n i_n + c_{n+1}; c_k \) are compile-time constants) of loop iteration variables and parameters; this includes imperfectly nested and non-rectangular loops. Array index expressions are also affine functions of iteration variables and parameters. Such program regions are typically the most computation-intensive components of scientific and engineering applications, and they appear often in important real-world code [10].

A statement \( s \) surrounded by \( m \) loops is represented by an \( m \)-dimensional polyhedron referred to as an iteration space polyhedron. The coordinates of a point in this polyhedron (referred to as an iteration vector) correspond to the values of the loop indices of the surrounding loops. The polyhedron can be defined by a system of affine inequalities derived from the bounds of the loops surrounding \( s \); each point in the polyhedron corresponds to a run-time instance of \( s \).

A significant advantage of using a polyhedral abstraction of statements and dependences is that compositions of loop transformations have a uniform algebraic representation that facilitates integrated global optimization of multi-statement programs. In particular, it is possible to represent arbitrary compositions of loop transformations in a compact and uniform manner, and to reason about their cumulative effects through well-defined algebraic cost functions. In contrast, with the traditional model of data dependence that is used in most optimizing compilers, it is very difficult to model the effect of a sequence of loop transformations. Previous work using unimodular transformations and iteration-reordering transformations (see for instance [7, 70, 63]) were limited to modeling the effect of sequences of iteration-reordering transformations. However, they could not accommodate transformations that changed a loop body such as distribution and fusion, or transformations on imperfect loop nests.

Further, global optimization across multiple statements is not easily accomplished (e.g., transformation of imperfectly nested loops is a challenge). Phase ordering effects as well as rigid and less powerful optimization strategies are all factors that make syntax-based transformations of loop nests less powerful than polyhedral-based ones for optimizing affine loop nests [36].

### 6.2 Functionality

The polyhedral transformation framework in PACE takes as input the Sage ASTs for all functions in an input RPU. In each AST, it identifies code fragments (i.e., AST subtrees) that can be targets of polyhedral optimizations. Each such fragment is referred to as a Static Control Part (SCoP). Each SCoP is analyzed and transformed; the result is a new subtree which is then inserted in the place of the original subtree in the function’s AST.

#### 6.2.1 Static Control Part (SCoP) Code Fragments

A SCoP is an AST subtree with a particular structure which allows powerful polyhedral analyses and optimizations. A conceptual grammar for a SCoP can be defined as follows:\(^2\)

---

\(^1\)A hyperplane is an \( n - 1 \) dimensional affine subspace of an \( n \)-dimensional space. A half-space consists of all points of an \( n \)-dimensional space that lie on one side of a hyperplane (including the hyperplane); it can be represented by an affine inequality. A polyhedron is the intersection of finitely many half-spaces.

\(^2\)This is an abstract description of the structure of the code; an actual SCoP will, of course, respect the grammar of the C language.
6.2. FUNCTIONALITY

\[
\langle \text{SCoP} \rangle ::= \langle \text{ElementList} \rangle \\
\langle \text{ElementList} \rangle ::= \langle \text{Element} \rangle | \langle \text{Element} \rangle \langle \text{ElementList} \rangle \\
\langle \text{Element} \rangle ::= \langle \text{Statement} \rangle | \langle \text{Loop} \rangle | \langle \text{If} \rangle \\
\langle \text{Loop} \rangle ::= \text{for} \langle \text{IteratorVariable} \rangle = \langle \text{LowerBound} \rangle , \langle \text{UpperBound} \rangle \{ \langle \text{ElementList} \rangle \} \\
\langle \text{If} \rangle ::= \text{if} \langle \text{Expression} \rangle \text{comp_op} \langle \text{Expression} \rangle \\
\{ \langle \text{ElementList} \rangle \} \text{else} \{ \langle \text{ElementList} \rangle \}
\]

Expressions and statements Each loop bound must be an affine expression \( c_1 i_1 + \ldots + c_n i_n + c_{n+1} \) where \( c_k \) are compile-time constants. The two expressions compared in operator \( \text{comp_op} \) in an if-statement must also be affine. Inside an statement, each index expression \( e_k \) in an array access expression \( a[e_1] \ldots [e_d] \) (where \( a \) is a \( d \)-dimensional array) must be an affine expression.\(^3\)

For every statement, each expression which denotes a memory location must be a scalar variable \( x \) or an array access expression \( a[e_1] \ldots [e_d] \). No pointer dereferences \( *p \) or accesses to structure fields \( s.f \) or \( p->f \) are supported. Conservatively, function calls are also disallowed in a statement. It may be possible to relax this last constraint by allowing calls to side-effect-free functions, assuming that such function names are provided by external sources.

Iterators, parameters, and affine expressions All scalar variables that appear anywhere in the SCoP can be classified into three disjoint categories:

- Loop iterators; there must not be any reference to a loop iterator which is a write, beside the \( for \) loop increment
- Parameters: not iterators; there must not be any reference to a parameter which is a write
- Modified variables: all variables referenced in the scop that are not loop iterators nor parameters

Expressions in loop bounds, if-statements, and array access expressions must be affine in SCoP parameters and in iterators of surrounding SCoP loops. Checking that an expression is of the form \( c_1 i_1 + \ldots + c_n i_n + c_{n+1} \) (where \( c_k \) are compile-time constants) is not enough; variables \( i_k \) need to be SCoP parameters or iterators of surrounding SCoP loops.

6.2.2 SCoP Detection and Extraction of Polyhedra

A high-level diagram of framework components and their interactions is shown in Figure 6.1. The first component, described in this subsection, takes as input the Sage ASTs for all functions in the input RPU. Each function-level AST is analyzed to identify subtrees that satisfy the definition of SCoP described above. In addition to the subtree, the SCoP detection also identifies SCoP parameters and iterators. Once a proper subtree is identified, it is traversed to extract its polyhedral representation. This representation contains a set of statements (each one corresponding to \( \langle \text{Statement} \rangle \) from the conceptual grammar), a set of parameter names, and a set of array names. This representation is the output of the SCoP Detection / Extraction stage.

In the polyhedral representation of a SCoP, each statement is associated with a set of iterator names, a matrix encoding the polyhedron that is the statement’s iteration space, a matrix representing the array elements that are read by the statement, and a matrix representing the array elements that are written by the statement. Scalar variables are treated as a special type of array with a single element.

\(^3\)The PACE implementation should be able to handle general affine expressions in C code — e.g., \( 3 \times 1 - 1 \times 13 + (-5) \times j + (-(-4)) \). In all such expressions, variables and constants must be of C integer types [17, §6.2.5].
6.2.3 Polyhedral Dependence Analysis with Candl

A central concept of program optimization is to preserve the semantics of the original program through the optimization steps. Obviously, not all transformations, and hence not all affine schedules (i.e., orderings of statement instances), preserve the semantics for all programs. To compute a legal transformation, we resort to first extracting the data dependences expressed in a polyhedral representation. This information is later used to constrain the schedules to ensure that they respect the computed dependences. The polyhedral dependence analysis stage takes as an input the polyhedral representation of a SCoP, and extends its content with data dependence information.

Candl, the Chunky ANalyzer for Dependences in Loops, is an open-source tool for data dependence analysis of static control parts [18]. To capture all program dependences, Candl builds a set of dependence polyhedra, one for each pair of array references accessing the same array cell (scalars being a particular case of array), thus possibly building several dependence polyhedra per pair of statements. A point in a dependence polyhedron encodes a pair of iteration vectors for which the dependence may occur. Given the polyhedral representation of the input RPU, Candl outputs the polyhedral dependence graph. It is a multi-graph with one node per statement, and an edge $e^{R \rightarrow S}$ labeled with a dependence polyhedron $\mathcal{D}_{R,S}$, for each dependence.

6.2.4 Pluto Transformation Generator

The polyhedral representation allows the expression of arbitrarily complex sequences of loop transformations. The downside of this expressiveness is the difficulty of selecting an efficient optimization that includes tiling together with fusion, distribution, interchange, skewing, permutation and shifting [36, 56]. The Pluto transformation stage takes as an input the polyhedral representation enriched with dependence information, and outputs a modified polyhedral representation where the original statement schedules have been replaced by those computed by Pluto.

Pluto is an automatic transformation open-source tool that operates on the polyhedral representation [55]. It outputs schedules (combinations of transformations) to be later applied by the
code generator. Pluto performs transformations for coarse-grained parallelism and locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but is not limited to it. OpenMP-like parallel code for multicores can be automatically generated from sequential code. Outer, inner, or pipelined parallelization is achieved, besides register tiling and exposing inner parallel loops for subsequent vectorization (see A for details about code vectorization).

6.2.5 Polyhedral Code Generation with CLooG

Code generation is the final step of polyhedral optimization. This stage takes as an input the polyhedral representation of SCoP enriched with the schedules computed by Pluto, and outputs a code fragment in CLooG’s internal syntactic representation, CLAST. The open-source CLooG code generator [8, 21] applies the transformations specified by the affine schedules, and generates a CLAST abstract syntax tree corresponding to the transformed code.

6.2.6 Parametric Tiling

Tiling is a crucial transformation for achieving high performance, especially with deep multi-level memory hierarchies. The tiling phase will take place inside the code generation stage, that is, it will take as an input a polyhedral representation of SCoP enriched with the schedules computed by Pluto, and output a CLAST tree being the result of the transformations embodied in the schedules combined with tiling. Tiling is a well known technique for improving data locality and register reuse. It has received a lot of attention in the compiler community. However, the majority of work only addresses the tiling of perfectly nested loops. The few systems that can automatically generate tiled code for imperfectly nested loops require that tile sizes be compile-time constants. The PolyOpt system will incorporate parametric tiling capability, where tile sizes do not have to be compile-time constants. Parametric tiled code will be passed by the PAO to the RTS (as discussed in Sec. 5.3.8) to determine the best tile sizes for the target platform.

6.2.7 Translation to Sage ASTs

The final stage of PolyOpt consists in translating the CLAST representation into a Sage AST, and reinserting this Sage AST in the RPU in place of the original Sage subtree for the SCoP. The result of the code generation in CLooG is represented using the CLooG IR, which provides enough information to generate an equivalent Sage AST subtree. The resulting modified Sage AST is indistinguishable from the “normal” ASTs generated by Rose’s front end, and can be used as input to other components of the PAO system.

6.3 Method

6.3.1 SCoP Detection and Extraction of Polyhedra

Given a Sage AST, a bottom-up traversal is used to identify AST subtrees that correspond to SCoPs. Since SCoPs cannot be nested, as soon as a node breaks the SCoP definition then none of its ancestor can be in a SCoP. During the traversal, when several sibling subtrees satisfy the SCoP definition, an attempt is made to create a larger SCoP encompassing these subtrees. At the end of this process, there may be several disjoint SCoP detected. Each one is independently subjected to the processing steps described in this section.

For each top-level element of the SCoP, a bottom-up traversal is performed for the Sage AST rooted at that node. During the traversal, information about upper/lower loop bounds is collected (represented as vectors that encode the affine constraints). Similarly, vectors encoding the read/write accesses of array elements are constructed. When all children of a node have been traversed,
their data is combined as necessary, based on the context of the node. When the root node of the subtree is reached, all necessary information for each statement appearing in the subtree has been collected.

6.3.2 Polyhedral Dependence Analysis with Candl

Data dependence representation Two executed statement instances are in a dependence relation if they access the same memory cell and at least one of these accesses is a write operation. For a program transformation to be correct, it is necessary to preserve the original execution order of such statement instances and thus to know precisely the instance pairs in the dependence relation. In the algebraic program representation described earlier, it is possible to characterize exactly the set of instances in dependence relation in a synthetic way.

Three conditions have to be satisfied to state that an instance $\vec{x}_R$ of a statement $R$ depends on an instance $\vec{x}_S$ of a statement $S$. (1) They must refer to the same memory cell, which can be expressed by equating the subscript functions of a pair of references to the same array. (2) They must be actually executed, i.e. $\vec{x}_S$ and $\vec{x}_R$ have to belong to their corresponding iteration domains. (3) $\vec{x}_S$ is executed before $\vec{x}_R$ in the original program.

Each of these three conditions may be expressed using affine inequalities. As a result, exact sets of instances in dependence relation can be represented using affine inequality systems. The exact matrix construction of the affine constraints of the dependence polyhedron used in PolyOpt was formalized by Bastoul and Feautrier [9, 11].

```plaintext
for (i = 0; i <= n; i++) {
    s[i] = 0; // statement R
    for (j = 0; j <= n; j++)
        s[i] = s[i] + a[i][j] * x[j]; // statement S
}
```

Figure 6.2: matvect kernel

For instance, if we consider the matvect kernel in Figure 6.2, dependence analysis gives two dependence relations: $\delta_{R,S}$ for instances of statement $S$ depending on instances of statement $R$ (e.g., $R$ produces values used by $S$), and similarly, $\delta_{S,S}$.

For Figure 6.2, dependence relation $\delta_{R,S}$ does not mean that all instances of $R$ and $S$ are in dependence—that is, the dependence does not necessarily occur for all possible pairs of $\vec{x}_R$ and $\vec{x}_S$. Let $\vec{x}_R = (i_R)$ and $\vec{x}_S = (i_S, j_S)$. There is a dependence from $R$ to $S$ only when $i_R = i_S$. We can then define a dependence polyhedron, being a subset of the Cartesian product of the iteration domains, containing all values of $i_R$, $i_S$, and $j_S$ for which the dependence exists. We can write this polyhedron in matrix representation: the first line represents the equality $i_R = i_S$, the next two encode the constraint that vector $(i_R)$ must belong to the iteration domain of $R$ and similarly, the last four state that vector $(i_S, j_S)$ belongs to the iteration domain of $S$:

$$D_{R,S}:
\begin{bmatrix}
1 & -1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 \\
-1 & 0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 & 0 \\
0 & -1 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 & 0 \\
0 & 0 & -1 & 1 & 0
\end{bmatrix}
\cdot
\begin{bmatrix}
i_R \\
i_S \\
j_S \\
n \\
1
\end{bmatrix}
\geq
0$$

To capture all program dependences we build a set of dependence polyhedra, one for each pair...
6.3. METHOD

of array references accessing the same array cell (scalars being a particular case of array), thus possibly building several dependence polyhedra per pair of statements. The polyhedral dependence graph is a multi-graph with one node per statement. An edge $e^{R \rightarrow S}$ is labeled with a dependence polyhedron $D_{R,S}$, for all dependence polyhedra.

A dependence polyhedron is the most refined granularity to represent a dependence. However, for the cases where this precision is not needed it is easy to rebuild a more abstract and less detailed dependence information from the polyhedral dependence graph. For instance, one can generate a simple graph of dependent memory references, or rebuild the dependence distance vectors by extracting some properties of the dependence polyhedra.

**Dependence analysis in Candl** The Candl software was written by Bastoul and Pouchet. It implements the construction of the complete polyhedral dependence graph of a given static control part. The algorithm to compute all polyhedral dependences simply constructs the dependence polyhedron for each pair of references to the same array, for all program statements. The polyhedron is then checked for emptiness. If it is empty then there is no dependence between the two considered references. Otherwise there is a (possibly self) dependence between the two references.

6.3.3 Pluto Transformation Generator

The tiling hyperplane method [13, 14] is a model-driven technique that seeks to optimize a SCoP through transformations encompassing complex compositions of multi-dimensional tiling, fusion, skewing, interchange, shifting, and peeling.

**Representing optimizations** A transformation in the polyhedral framework captures in a single step what may typically correspond to a sequence of numerous textbook loop transformations [36]. It takes the form of a carefully crafted affine multidimensional schedule, together with iteration domain and/or array subscript transformations.

In the tiling hyperplane method, a given loop nest optimization is defined by a multidimensional affine schedule. Given a statement $S$, we use an affine form on the surrounding loop iterators $\vec{x}_S$. It can be written as

$$\Phi^S(\vec{x}_S) = C^S(\vec{x}_S)$$

where $C^S$ is a matrix of non-negative integer constants. The instance of $S$ defined by iteration vector $\vec{x}_S$ is scheduled at multidimensional date $\Phi^S(\vec{x}_S)$. Multidimensional dates can be seen as logical clocks: the first dimension corresponds to days (most significant), next one is hours (less significant), the third to minutes, and so on. Note that every static control program has a multidimensional affine schedule [34], and that any loop transformation can be represented in the polyhedral representation [70].

Let $\phi^S_i$ be the $i^{th}$ row of $C^S$. A row is an affine hyperplane on the iteration domain of $S$. For $S$ with $m_S$ surrounding loop iterators, let

$$\phi^S_i = [c^S_1 c^S_2 \ldots c^S_{m_S} c^S_0]$$

Here $c^S_i$ are integer constants; $c^S_0$ is the coefficient attached to the scalar dimension.

**The tiling hyperplane method** Intuitively, the tiling hyperplane method computes an affine multidimensional schedule [34] for the SCoP such that parallel loops are at the outer levels, and loops with dependences are pushed inside [13, 14], and at the same time, maximizing the number of dimensions that can be tiled. The method proceeds by computing the schedule level by level, from the outermost to the innermost. Specifically, affine hyperplanes with special properties are computed, one for each row of the scheduling matrix. Such specific hyperplanes are called tiling hyperplanes.
Computing valid tiling hyperplanes

Let $S = \{S_1, S_2, \ldots, S_n\}$ be the set of statements of the SCoP. Let $G = (V, E)$ be the data dependence graph for the original SCoP. $G$ is a multi-graph with $V = S$ and $E$ being the set of data dependence edges. Notation $e^{S_i \rightarrow S_j} \in E$ denotes an edge from $S_i$ to $S_j$, but we will often drop the superscript on $e$ for readability. For every edge $e \in E$ from $S_i$ to $S_j$, $D_{S_i, S_j}$ is the corresponding dependence polyhedron.

Tiling along a set of dimensions is legal if it is legal to proceed in fixed block sizes along those dimensions. This requires dependences to not be backward along those dimensions [41, 58, 13].

$\{\phi_{S_1}, \phi_{S_2}, \ldots, \phi_{S_k}\}$ is a legal (statement-wise) tiling hyperplane if and only if the following holds true for each dependence edge $e^{S_i \rightarrow S_j} \in E$:

$$\phi_{S_j}(\vec{x}_{S_j}) - \phi_{S_i}(\vec{x}_{S_i}) \geq 0, \quad \langle \vec{x}_{S_i}, \vec{x}_{S_j} \rangle \in D_{S_i, S_j} \quad (6.1)$$

Cost model for computing the tiling hyperplanes

There are infinitely many hyperplanes that may satisfy the above criterion. An approach that has proved to be simple, practical, and powerful has been to find those directions that have the shortest dependence components along them [13].

For polyhedral code, the distance between dependent iterations can always be bounded by an affine function of the SCoP parameters (represented as a $p$-dimensional vector $\vec{n}$).

$$\forall \langle \vec{x}_{S_i}, \vec{x}_{S_j} \rangle \in D_{S_i, S_j}, \delta_e(\vec{x}_{S_i}, \vec{x}_{S_j}) = \phi_{S_j}(\vec{x}_{S_j}) - \phi_{S_i}(\vec{x}_{S_i})$$

$$\forall \langle \vec{x}_{S_i}, \vec{x}_{S_j} \rangle \in D_{S_i, S_j}, \forall e \in E, \vec{u} \in \mathbb{N}^p, w \in \mathbb{N}, u.\vec{n} + w - \delta_e(\vec{x}_{S_i}, \vec{x}_{S_j}) \geq 0 \quad (6.2)$$

The legality and bounding function constraints from (6.1) and (6.2) respectively are cast into a formulation involving only the coefficients of $\phi$’s and those of the bounding function by application of the Farkas Lemma [34]. Coordinates of the bounding function are then used as the minimization objective to obtain the unknown $\phi$’s coefficients.

$$\text{minimize} \quad (u, w, \ldots, c_i, \ldots) \quad (6.3)$$

This cost function is geared towards maximal fusion. This allows to minimize communication and maximize locality on the given set of statements. The resulting transformation is a complex composition of multidimensional loop fusion, distribution, interchange, skewing, shifting and peeling. Finally, multidimensional tiling can be applied on all permutable bands.

Enabling vectorization

Due to the nature of the optimization algorithm, even within a local tile (L1) that is executed sequentially, the intra-tile loops that are actually parallel do not end up being outer in the tile: this goes against vectorization of the transformed source for which we rely on the native compiler. Also, the polyhedral tiled code is often complex for a compiler to further analyze and say, permute and vectorize. Hence, as part of a post-process in the transformation framework, a parallel loop is moved innermost within a tile, and annotations are used to mark the loop as vectorizable (see A for details about code vectorization). Similar reordering is possible to improve spatial locality that is not considered by our cost function due to the latter being fully dependence-driven. Note that the tile shapes or the schedule in the tile space is not altered by such post-processing.

6.3.4 Polyhedral Code Generation with CLooG

The code generation stage generates a scanning code of the iteration domains of each statement with the lexicographic order imposed by the schedule. Statement instances that share the same date are typically executed under the same loop, resulting in loop fusion. Scanning code is typically
6.3. METHOD

an intermediate, AST-based representation that can be translated to an imperative language such as C or FORTRAN.

For many years this stage has been considered to be one of the major bottlenecks of polyhedral optimization, due to the lack of scalability of the code generation algorithms. Eventually the problem was addressed by the work of Bastoul [8, 9] who proposed an extended version of Quilleré’s algorithm [57] that significantly outperformed previously implemented techniques such as the ones by Kelly et al. in the Omega framework [43] or by Griebl in the Loopo framework [37]. The only constraints imposed by the CLooG code generator are (1) to represent iteration domains with a union of polyhedra, and (2) to represent scheduling functions as affine forms of the iteration domain dimensions. This general setting removes previous limitations such as schedule invertibility [6].

Code generation time is a function of the number of statement domains and the depth of the loop structure to be generated. Polyhedral tiling can be performed directly with CLooG when using constant (i.e., scalar) tile sizes, by extending the dimensionality of the iteration domains. This approach significantly increases code generation time, because of the higher number of domain dimensions. In PolyOpt this problem is avoided by the parametric tiling approach: the domain dimension is not extended before using CLooG, but instead after the polyhedral code generation. Preliminary results demonstrating the strong scalability gain for code generation time can be found in [39].

The CLooG code generator is unanimously considered the state-of-the-art polyhedral code generator, as it implements the latest and most scalable algorithm for code generation [8]. Given the polyhedral representation of the SCoP together with the schedules computed by Pluto, it outputs a CLooG AST in an internal representation referred to as CLAST. This representation is then translated back into a Sage AST.

6.3.5 Translation to Sage ASTs

The translation of a CLooG AST to a Sage AST is based on a bottom-up traversal of the CLooG CLAST IR, which involves

- re-mapping of control structures and expressions
- mapping back to symbols that existed in the original program (e.g., names of arrays and parameters)
- introduction of new symbols in symbol tables (e.g., new iterators)
- rewriting of array index expressions and loop bounds in terms of the new iterators

6.3.6 Parametric Tiling

Before providing a detailed description of our approach to parametric tiling imperfectly nested loops, we first use a number of examples, figures and pseudocode fragments to explain the key ideas behind the approach.

We first begin by discussing the approach to generation of parametric tiles in the context of perfectly nested loops. We then discuss the conditions under which we can tile imperfectly nested loops, followed by a sketch of our approach to geometric separation of tiles for imperfectly nested loops.

**Parametric Tiling for a Single Statement Domain**

Consider the simple 2D perfectly nested loop shown in Figure 6.3(a). The perfect loop nest contains an inner loop $j$ whose bounds are arbitrary functions of the outer loop variable $i$. Consider a non-rectangular iteration space displayed in Figure 6.3(d), corresponding to the perfect loop nest in this
for (i=lb;i<ubi;i+=sti) {
    for (j=lbj(i);j<=ubj(i);j+=stj)
        S(i,j);
}

(a) A perfectly nested loop

/* tiled loop */
for (it=lb;i<ubi-(Ti-sti);it+=Ti) {
    ... (code tiled along j) ...
}
/* epilog loop */
for (i=it;i<ubi;i+=sti) {
    for (j=lbj(i);j<=ubj(i);j+=stj)
        S(i,j);
}

(b) Tiling the outermost loop i

for it
    [scan code to obtain lbv, ubv]
if (lbv < ubv) {
    [prolog j]
    [tiled j]
    [epilog j]
} else {
    [untiled j]
}
[epilog i]

(c) After tiling loops i and j

/* tiled i */
for (it=lb;i<ubi-(Ti-sti);it+=Ti) {
    /* scan code */
    lbv = MIN_INT;
    ubv = MAX_INT;
    for (i=it;i<it+(Ti-sti);i+=sti) {
        lbv = max(lbv, lbj(i));
        ubv = min(ubv, ubj(i));
    }
    if (lbv < ubv) {
        /* prolog */
        for (i=it;i<it+(Ti-sti);i+=sti) {
            for (j=lbj(i);j<ubv-stj;j+=stj)
                S(i,j);
            /* tiled */
            for (j=lbj(i);j<ubv-(Tj-stj);j+=Tj)
                for (i=it;i<it+(Ti-sti);i+=sti) {
                    for (j=jt;j<Tj-stj;j+=stj)
                        S(i,j);
                }
        }
    }
    else {
        /* untiled */
        for (i=it;i<it+(Ti-sti);i+=sti) {
            for (j=lbj(i);j<ubj(i);j+=stj)
                S(i,j);
        }
    }
[epilog i]

(d) Iteration space (tiled along i axis)

(e) Iteration space (tiled along i and j axes)

(f) Detailed parametric tiled code

Figure 6.3: Parametric tiling of a perfectly nested loop
example. Since loop $i$ is outermost, strip-mining or tiling this loop is straightforward and always legal (that is, to partition the loop $i$'s iteration space into smaller blocks whose size is determined by the tile size parameter $T_i$). Figure 6.3(d) shows the partitioning of the iteration space along dimension $i$. Figure 6.3(b) shows the corresponding code structure, with a first segment covering as many “full” tile segments along $i$ as possible (dependent on the parametric tile size $T_i$). The outer loop in the tiled code is the inter-tile loop that enumerates all tile origins. Following the full-tile segment is an epilog section that covers the remainder of iterations (to be executed untiled). The loop enumerates the points within the last incomplete group of outer loop iterations that did not fit in a complete $i$-tile of size $T_i$.

For each tiling segment along $i$, full tiles along $j$ are identified. For ease of explanation, we show a simple “explicit scanning” approach to finding the start of full tiles, but the actual implementation will compute it directly for affine loop bounds by evaluating the bound functions at corner points of the outer tile extents. The essential idea is that the largest value for the $j$-lower bound ($lbv$) is determined over the entire range of an $i$-tile and it represents the earliest possible $j$ value for the start of a full $ij$ tile. In a similar fashion, by evaluating the upper-bound expressions of the $j$ loop, the highest possible $j$ value ($ubv$) for the end of a full tile is found. If $lbv$ is greater than $ubv$, no full tiles exist over this $i$-tile range. In Figure 6.3(e), this is the case for the last two $i$-tile segments. For the first $i$-tile segment in the iteration space (the second vertical band in the figure, the first band being outside the polyhedral iteration space), $lbv$ equals $ubv$. For the next two $i$-tile segments, we have some full tiles, while the following $i$-tile segment has $ubv$ greater than $lbv$ but by a smaller amount than the tile size along $j$.

The structure of the tiled code is shown in abstracted pseudo-code in Figure 6.3(c), and with explicit detail in Figure 6.3(f). At each level of nesting, for a tile range determined by the outer tiling loops, the $lbv$ and $ubv$ values are computed. If $ubv$ is not greater than $lbv$, an untiled version of the code is used. If $lbv$ is less than $ubv$, the executed code has three parts: a prolog for $j$ values up to $lbv - stj$ (where $stj$ is the loop stride in $j$ dimension), an epilog for $j$ values greater than or equal to $jt$ (where $jt$ is the inter-tile loop iterator in $j$ dimension), and a full-tile segment in between the prolog and epilog, to cover $j$ values between the bounds. The code for the full-tile segment can be generated using a recursive procedure that traverses the levels of nesting. The detailed tiled code for this example is shown in Figure 6.3(f).

The separation of partial tiles and full tiles thanks to nested prolog-epilog regions leads to generating loops with a simple control (ie, the loop bounds have a low computational cost). The drawback is a potentially exponential code size. An alternative to control the code size explosion is to use complex min/max expressions into the loop bounds directly, such that a single loop in the target code implements several partial tile cases. There is a trade-off between 1) preventing code size explosion using the prolog-epilog approach which pollutes the instruction cache, and 2) generating smaller code size but then introducing a higher control overhead in computing the loop bounds. We plan to investigate this trade-off, to tune a parametric tiling code generation heuristic for full tile separation that would maximize performance.

Tiling of Multi-Statement Domains

The iteration-space view of legality of tiling for a single statement domain is expressed as follows: a hyperplane $H$ is valid for tiling if $Hd_i \geq 0$ for all dependence vectors $d_i$ [41]. This condition states that all dependences are either along the tiling hyperplane or enter it from the same side. This sufficient condition ensures that there cannot be cyclic dependences between any pair of tiles generated using families of hyperplanes that each satisfy the validity condition.

For a collection of polyhedral domains corresponding to a multi-statement program (from imperfectly nested loops), the generalization of the above condition is: a set of affine-by-statement functions $\phi$ (corresponding to each statement in the program) represents a valid tiling hyperplane
if $\phi_i(t) - \phi_s(s) \geq 0$ for each pair of dependences $(s, t)$ [14]. The affine-by-statement function $\phi$ maps each instance of each statement to a point in a dimension of a target iteration space. A set of linearly independent $\phi$ functions maps each instance of each statement into a point in the multidimensional target space. If each $\phi$ function satisfies the above generalized tiling condition, the multi-statement program can be rectangularly tiled in the transformed target iteration space. If only a (contiguous) subset of the $\phi$ functions satisfies the generalized tiling condition, tiles can be formed using families of hyperplanes from that subset.

Efficient code generation for multi-statement domains was a significant challenge until the Quilleré algorithm [57] was developed. Its implementation in CLooG is now widely used for generating code for multi-statement domains. The Pluto system uses CLooG for generating (non-parametric) tiled code for imperfectly nested loop programs. However, Pluto cannot generate parametric tiled code for imperfectly nested loops. Our approach to parametric tiling of imperfect loop-nests will combine the power of the Quilleré algorithm (in sorting and separating polyhedra corresponding to multiple-statement domains) with a geometric approach for separating tile, using the AST structure generated by the Quilleré algorithm for non-tiled imperfectly nested loop code generation.

First an input program is transformed to a target domain using scattering functions that satisfy the above generalized tiling condition. For this purpose, the scattering functions generated by the Pluto system are used, but any set of schedules that satisfy the generalized tiling condition can be used instead. The imperfectly nested loop structure generated by use of the Quilleré algorithm is scanned to generate the tiled code structure as described in the next subsection.
6.4. RESULTS

Geometric Separation of Tiles for Overlapping Statement Domains

Figure 6.4 illustrates the approach to geometric tile separation. The imperfectly nested loop $i$ considered in this example contains two inner loops with loop bounds that are functions of loop iterator $i$ and other global parameters such as tile sizes and input problem sizes. The Quilleré algorithm generates efficient (non-tiled) loop code for multi-statement polyhedral domains arising from imperfectly nested loops. Where feasible, the sorting of polyhedra within the Quilleré algorithm enables separation of statements in the point-wise (non-tiled) code. The key tiling question for this two-statement example is: if the two statements $S_1$ and $S_2$ have been separated out in the point-wise code by the Quilleré algorithm, under what conditions can we also separate out tiles corresponding to these two statements? Our answer to this question is to use the lower and upper bound values for the two statements, (computed in a similar manner to the perfect-nest example above) and exploit the fact that all dependences are lexicographically non-negative in all the tiling dimensions (due to satisfaction of the generalized tiling condition).

For the example shown in Figure 6.4(a), since $lbv_1$ is less than $ubv_1$, we have a separable set of tiles for $S_1$, and since $lbv_2$ is less than $ubv_2$, there are also separable tiles for $S_2$. The prolog of $S_2$ and epilog of $S_1$ need to be combined and interleaved to ensure satisfaction of any dependences between $S_1$ to $S_2$ or vice versa. The pseudocode in Figure 6.4(c) shows the different possible cases to be considered and the code corresponding to the four combinations.

6.4 Results

The PolyOpt subsystem prototype is expected to be complete by September 2010 (end of project’s Phase 1).
Chapter 7
AST-based Transformations in the Platform-Aware Optimizer

7.1 Introduction and Motivation

This chapter summarizes the design of AST-based transformations in the Platform-Aware Optimizer (PAO). As discussed in Chapter 5, each of these transformations will be followed by incremental re-analysis. These transformations complement the polyhedral transformation framework described in Chapter 6, both by performing transformations on regions of code that are ineligible for polyhedral transformation (non-SCoP regions) and by performing transformations that are not included in the polyhedral framework (such as data transformations, idiom recognition, scalar replacement, and loop-invariant redundancy elimination). AST-based transformations contribute to the overall goal of the PAO to automate selection of an appropriate set of high level transformations for a given platform as viewed through the lens of the platform-specific resource characteristics derived by the PACE RC tools.

Most of the non-polyhedral transformations in the PAO are extensions of classical high-level loop and data transformations introduced in previous work [71, 3, 61]. However, there has been relatively little attention paid in past work to the question of which transformations should be automatically selected for optimizing performance, especially for the wide set of transformations available in the PAO’s arsenal. Automatic selection of transformations is a critical issue for the PACE Compiler because the developers of the PACE Compiler do not know a priori the performance characteristics of the target platform. In PACE, the compiler must adapt to new platforms using measured values for performance-critical system characteristics, where a vendor compiler can be carefully tailored to match a known processor release.

An important aspect of high level transformations in the PAO that distinguishes them from many lower-level transformations in the TAO is that most high level transformations are reversible and, if improperly selected, can degrade performance just as effectively as they improve performance. For example, loop interchange can improve the cache locality of a loop nest with a poor loop ordering, but it can also degrade the performance of a well-tuned loop nest. In contrast, while the performance improvement obtained by traditional lower-level optimizations (e.g., operator strength reduction) can vary depending on the source program and target platform, such optimizations typically do not significantly degrade performance.

Our overall approach to address this issue is to leverage the separation of concerns mentioned in Chapter 5 among Legality Analysis, Profitability Analysis, and IR Transformation, and to use a quantitative approach to profitability analysis. The problem of selecting high level transformations is decomposed into different optimization problems that address the utilization of different classes of...
of hardware resources (e.g., memory hierarchy, inter-core parallelism, intra-core parallelism). The formulations of the optimization problems are based on quantitative cost models, which are built on measured characteristics of the target system and application characteristics that include measured context-sensitive profiles. Multiple transformations may be used to optimize a single class of hardware resources (e.g., loop interchange, tiling and fusion may all be used in tandem to improve memory hierarchy locality), and a single transformation may be employed multiple times for different resource optimizations (e.g., the use of loop unrolling to improve both register locality and instruction-level parallelism).

7.2 Functionality

As described in § 5.2, the PAO takes as input refactored program units (RPUs) generated by the Application-Aware Partitioner (AAP), and generates as output transformed versions of each RPU using a combination of polyhedral and AST-based transformations.

7.2.1 Input

The primary input for a single invocation of the AST-based transformer is the HIR (SAGE III IR) for an RPU, as generated by the AAP. Additional inputs (as shown in Figure 5.1) include compiler directives from the optimization plan, resource characteristics for the target platform, profile information with calling-context-based profile information for the source application, and TAO cost analysis feedback (Path 3 in Figure 5.2).

7.2.2 Output

As its output, the AST-based transformer produces a transformed HIR for the input RPU. The transformed code can be translated into either C source code or into the IR used in the Target-Aware Optimizer (TAO). This latter case uses the PAO–TAO IR translator, described in Chapter 8; the translator is also used in the PAO–TAO query mechanism, as shown in Figure 5.2.

7.3 Method

The overall structure of AST-based transformations in the PAO for a single function is as follows. Though not listed explicitly, the incremental program reanalysis described in § 7.3.6 is assumed to be performed after each transformation listed below. The transformations described below will be performed on all functions within the RPU, starting with entry functions (functions called from other RPU’s), and transitively traversing the call graph within the RPU.

1. Perform function inlining and path duplication within an RPU. This step goes beyond code duplication performed by the AAP, and is driven by context-sensitive and path-sensitive execution profiles obtained by the PACE Runtime System.

2. Perform canonical program analyses. As indicated in Chapter 5, these analyses include Global Value Numbering, Constant Propagation, and Induction Variable Analysis. This analysis information will be updated incrementally, whenever a transformation is performed by a later step.

3. Perform preprocessing transformations. The purpose of this step is to increase opportunities for subsequent polyhedral and non-polyhedral transformations. It will start with a clean-up phase that includes Unreachable Code Elimination, Dead Code Elimination, and to separate SCoP-compliant and non-SCoP-compliant statements into separate loop nests as far as possible. (SCoP stands for “Static Control Part”, and represents a loop nest that is amenable
to polyhedral transformations. See § 6.2.1.) It will also attempt to maximize the number of enclosing perfectly nested loops for each statement.

4. Identify SCoPs in each function, and invoke the PolyOpt component separately for each SCoP. As described in Chapter 6, the PolyOpt component performs a number of loop transformations on each SCoP including fusion, distribution, interchange, skewing, permutation, shifting and tiling, in addition to identifying vectorizable loops that are marked as such and passed to TAO for vectorization described in Appendix A.

5. Perform the following steps for each maximal non-SCoP loop nest in the IR

(a) Pattern-driven Idiom Recognition — if the loop nest is found to match a known library kernel (or can be adapted to make the match), then replace the loop nest with the appropriate library call. This transformation can be applied even to the SCoP-compliant loop nests, and it may even lead to a better code than polyhedral transformations. We will evaluate both alternatives. More details are given in § 7.3.1.

(b) Loop Privatization — create private per-iteration copies of scalar and array variables, when legal to do so.

(c) Locality optimization — use the measured characteristics of the target machine’s memory hierarchy (from the PACE RC tools) to select a set of interchange, tiling and fusion transformations to optimize locality (with support from other iteration-reordering loop transformations as needed, such as loop reversal and loop skewing).

(d) Parallelization of outermost loop nests — if the loop nest does not already have explicit OpenMP parallelism, use OpenMP to automatically create parallel loops at the outermost level, with loop coalescing for efficiency.

(e) Unrolling of innermost loop nests — use iterative feedback from the TAO (guided by measured processor characteristics) to select unroll factors for each innermost loop nest. This transformation can be applied even to the loops produced by the PolyOpt component. More details are provided in § 7.3.4.

(f) Scalar replacement — perform loop-carried and loop-independent scalar replacement of array and pointer accesses within the loop nest. More details are provided in § 7.3.5

(g) Commit all transformations and perform incremental reanalysis

6. Return the updated SAGE III IR to the compiler driver so that it can invoke the later steps of compilation, including the PAO→TAO IR translator.

### 7.3.1 Pattern-driven Idiom Recognition

In some cases, a computational kernel in the input application may be implemented by a platform-specific library such as BLAS call. If so, it is usually beneficial to replace the user-written kernel by call to the platform-specific library. However, in addition to recognizing opportunities for this transformation, it is important to factor in the cost of adaptation (e.g., additional copies).

For example, consider the code fragment in Figure 7.1. On one platform, the PAO might select a combination of tiling, interchange, unrolling, and scalar replacement as usual. Tile sizes are initialized using analytical cost model and updated by runtime, while the unroll factors are proposed by cost model and refined by feedback from TAO.

However, on a different platform, the PAO might recognize that the computation above can be implemented with two library calls (matrix multiply and transpose) that are available in optimized

---

1Transformations in the list will be applied to the non-SCoP loop nest. The PolyOpt framework applies some of these same transformations, such as loop tiling, to the SCoP loop nests, making it unnecessary for the PAO to apply them individually.
for(i = 0; i < n; i++){
    for (j = 0; j < n; j++){
        a[i,j] = 0;
        for (k = 0; k < n; k++){
            a[i,j] = a[i,j] + b[j,k] * c[k,i];
        }
    }
}

Figure 7.1: Matrix multiplication and transpose

form on that platform. The PAO will still explore transformations as in previous case (using system characterization values for this particular platform), but it may conclude that the cost of using library routines will be lower than the compiler-optimized version for values of n greater than some threshold.

7.3.2 Loop Tiling

Loop tiling is a critical optimization for effectively using the memory hierarchy on the target machine. In order to maximize cache usage, PAO will have to select the right combination of the tile size, unroll factor, and loops to interchange. Tile size will naturally depend on the measures values for cache size and associativity of the target platform from the PACE RC tools. The PAO will use an analytical model parameterized (Sections 6.2.6 and 6.3.6) by the estimates of the regions of the array that are accessed in the loop nest and the measured cache sizes to determine the initial tile size. That tile size will, in turn, be tuned at runtime using the online feedback-directed parameter selection facility of the PACE RTS (§ 10.3.4). The code for the parameterized version of the loop will be packaged for runtime tuning using the approach described in § 5.3.8. The non-SCoP loop nests that cannot be parameterized for tiling will still be tiled by the PAO using a simple static tiling approach.

Figure 7.2 shows the example from Figure 7.1, tiled with a BxB tile size across the i and j dimensions, as might be done to prepare for online tuning by the RTS.

for(i2 = 0; i2 < n; i2 += B){
    for(j2 = 0; j2 < n; j2 += B){
        a[i2,j2] = 0;
        for(k = 0; k < n; k++){
            for(i1 = i2; i1 < min(i2 + B - 1, n); i1++){
                for(j1 = j2; j1 < min(j2 + B - 1, n); j1++){
                    a[i1,j1] = a[i1,j1] + b[j1,k] * c[k,i1];
                }
            }
        }
    }
}

Figure 7.2: Matrix multiplication and transpose, tiled with a BxB tile size
7.3. METHOD

for(i2 = 0; i2 < n; i2 += B){
    for(j2 = 0; j2 < n; j2 += B){
        a[i2,j2] = 0;
        for(k = 0; k < n; k++){
            for(j1 = j2; j1 < min(j2 + B - 1, n); j1++){
                for(i1 = i2; i1 < min(i2 + B - 1, n); i1++){
                    a[i1,j1] = a[i1,j1] + b[j1,k] * c[k,i1];
                }
            }
        }
    }
}

Figure 7.3: Matrix multiplication and transpose, tiled with a BxB tile size, with i1 and j1 loops interchanged

7.3.3 Loop Interchange

Loop interchange is another important compiler transformation that can significantly improve the performance through improving locality and increasing the effect of loop tiling described above.

For example, in the tiled matrix multiplication and transpose example on Figure 7.2, the elements of a tile are accessed in a row-major order, while the arrays are stored in column-major order in Fortran. If the whole tile fits in cache and the arrays are lined up properly to avoid conflict misses, then the code on figure 7.2 should perform equally well regardless of the order of the i1 and j1 loops. If not, the performance can be improved by interchanging the two inner loops, as shown on Figure 7.3.

7.3.4 Unrolling of Nested Loops

Loop unrolling can significantly improve code performance by reducing the loop iteration overhead and increasing the size of the loop body, making further optimizations of the loop body more effective. However, excessive loop unrolling can create additional register pressure, which can have detrimental effect on performance if the register allocator is forced to spill some values to memory.

For each loop nest, PAO will generate multiple unroll configurations and invoke TAO to evaluate the code generated for each configuration using the PAO-TAO query interface described in §5.3.5. TAO’s answers to PAO’s queries (§9.3.4) will be then analyzed by PAO to select the best unroll configuration.

PAO will prune this search space sing an analytical cost model to compute the infeasible unroll configurations based on the register pressure of the unrolled loop and the measured number of registers available on the target platform. PAO will only evaluate the feasible unroll configurations.

Figure 7.4 shows an example of a search space for unroll configurations for the middle and outermost loops in a triple nested loop from Figure 7.1, on a hypothetical platform with 16 registers. Instead of searching the whole space of 380 unroll configurations, PAO will only evaluate 44 feasible unroll configurations.

7.3.5 Scalar Replacement

Scalar replacement is another classical optimization that will have a large impact on the performance of the code generated by the PAO. Scalar replacement reduces memory access, by rewriting the code so that the compiler can store a reused value in a register instead of in memory. Unfortu-
nately, this rewrite can both increase register pressure and reduce available parallelism. Thus, the PAO will need to strike the right balance between the potential for improvement and the potential for degradation by choosing carefully those array elements to be rewritten as scalar variables. This choice must work well within the tile size and unroll factors that the PAO has selected for the loop, as discussed earlier.

Figure 7.5 shows the code from Figure 7.1 where the array element \( a[i,j] \) has been replaced with a scalar sum.

7.3.6 Incremental Reanalysis

Incremental reanalysis in the PAO is supported by maintaining four auxiliary hierarchical structures on the HIR described in the following paragraphs — a Region Structure Tree (RST), a Region Control Flow Graph for each region, Region-SSA form for each region, and a Region Dependence Graph for each region. The partitioning of the input program into regions can be tailored to optimizations of interest, subject to the constraints defined below. A common partitioning is to place

```c
for (i = 0; i < n; i++){
    for (j = 0; j < n; j++){
        sum = 0;
        for (k = 0; k < n; k++){
            sum = sum + b[j,k] * c[k,i];
        }
        a[i,j] = sum;
    }
}
```

Figure 7.5: Matrix multiplication and transpose with scalar replacement of the \( a[i,j] \) element
7.3. METHOD

S1: \( x = 1; \)
S2: if \( y > M \) goto S12;
S3: \( i = 1; \)
S4: if \( i \geq M \) goto S12;
S5: if \( i = M \) goto S12;
S6: \( y = \text{invoke sqrt}(i); \)
S7: \( d0 = \text{arr}[i-1]; \)
S8: \( d1 = d0 / y; \)
S9: \text{arr}[i] = d1;
S10: \( i = i + 1; \)
S11: goto S4;
S12: \( y = x + 1; \)
S13: LoopRegionExit
S14: \( y = x + 1; \)

(a) Original Code

(b) Code with Region Labels

Figure 7.6: Example IR with Region Labels

each loop in a separate region (as in the Loop Structure Tree [61]) but other partitions are possible.

**Region Structure Tree** The Region Structure Tree represents the region nesting structure of the RPU being compiled. Each region node (R-node) of the RST represents a single-entry region of the original (flat) control flow graph (CFG) for the procedure or function, and each leaf node (L-node) of the RST corresponds to a node in that CFG. Hierarchical nesting of regions is captured by the parent-child relation in the RST. The RST can accommodate any partitioning of the CFG into hierarchical single-entry regions.

The root node of the RST represents the entire RPU being compiled. A non-root R-node represents a subregion of its parent’s region. An L-node represents a CFG node (basic block) that is contained within the region corresponding to the parent of the L-node. We impose three important constraints on legal region structures in a RST:

1. **Tree Structure** The nesting structure for regions must form a single connected tree (specifically, the RST), and there must be a one-to-one correspondence between leaf nodes in this tree and nodes in the input CFG. This constraint implies that if two regions \( r1 \) and \( r2 \) in the RST have a non-empty intersection, then it must be the case that either \( r1 \subseteq r2 \) or \( r2 \subseteq r1 \).

2. **Proper Containment** Each R-node must have at least one child in the RST that is an L-node. This implies that the region corresponding to a non-root R-node \( r \) must be properly contained within the region of its parent’s node parent(\( r \)) (because it will not contain at least one L-node that is a child of parent(\( r \))). Another consequence of this constraint is that there can be at most as many region nodes as there are nodes in the input CFG.

3. **Single-entry Regions** Each region must be a single-entry subgraph of the input CFG. In the (rare) event that the input CFG contains an irreducible subgraph (a strongly connected subgraph with multiple entries), then the entire irreducible subgraph must be included in a containing single-entry region.

Thus, there may be several legal RST’s for the same input CFG (though they will all have the same set of L-nodes). An R-node serves as a useful anchor for all information related to the region.
corresponding to the R-node. All of the region-local data structures are stored in the R-node for the region, including references to the Region Control Flow Graph (RCFG), and the Region-SSA form built upon the RCFG.

Figure 7.6.(a) shows an example intermediate representation, and its extension with region labels in Figure 7.6.(b). The original (flat) CFG for this example is shown in Figure 7.7; it uses the statement labels from Figure 7.6.(a). This CFG assumes the following: S4 performs the test of index variable i for the loop; S10 performs the increment of index variable i in each iteration (following a standard translation approach for for-loops); and the edge from S11 to S4 is the loop back-edge. Let us assume that the compiler selects a two-level hierarchical region partition of the CFG in which R2 = \{S4, S5, S6, S7, S8, S9, S10, S11\} is the inner region and R1 = \{S1, S2, S3, R2, S12\} is the outer region. The RST for this region partition is shown in Figure 7.8; again, it uses the labels from Figure 7.6.(a).

For annotating a region at the IR level, we add two statement labels (pseudo-instructions) into the IR code list — LoopRegionEntry and LoopRegionExit. Figure 1(b) shows the IR code with these new region labels (see S4 and S13). The LoopRegionEntry statement contains two sets, Use and Def, which maintain the SummaryReferences of the current region R2 (details on SummaryReferences will be discussed in the next section). The Use and Def sets for array variables can be made more precise by using Array SSA form [45] instead of SSA form.

**Region Control Flow Graph** For each R-node, R, in the RST, we have a region-level control flow graph, RCFG(R), that defines the local control flow for R’s immediate children in the RST (the immediate children may be L-nodes or R-nodes). RCFG(R) must contain a node corresponding to each node that is a child of R. RCFG(R) also contains two pseudo nodes: START and EXIT. The START node is the destination of all region entry branches. Since R must be a single-entry region, there is no loss in precision in using a single START node. All LCFG edges from START have the same destination: the region’s entry node. The EXIT node is the target of all region exit branches. In the example shown in Figure 7.6.(b), the START and EXIT nodes correspond to statements S4 and S13.

An edge e from X to Y within RCFG(R) represents control flow in region R. If edge e is a conditional branch (i.e. if there are multiple outgoing edges from X in RCFG(R)), then it also carries a label of the form (S, C) identifying condition C in IR statement S as the branch condition that

---

**Figure 7.7: Original Control Flow Graph**

**Figure 7.8: Region Structure Tree**
enabled execution to flow along $e$. See $(S5, T)$ in Figure 7.9 as an example. (Other labels have been omitted so as to reduce clutter.)

For each control flow exit from region $R$, an edge is inserted in $R$’s RCFG with target EXIT, and another edge is inserted in the RCFG of $R$’s RST parent from $R$ to the exit destination. (If multiple regions are being exited, then additional outer-region edges need to be inserted in enclosing regions).

Figure 7.9 shows the RCFG’s for regions $R1$ and $R2$ from the RST in Figure 7.8. The statements have been renumbered to match Figure 7.6.(b) and the graph for RCFG($R1$) contains a node for $R2$. In addition, note that the break statement is modeled as a premature loop exit edge from $S5$ in RCFG($R2$) with destination = EXIT; there is also an edge in RCFG($R1$) from node $R2$ to node $S14$. These two edges in RCFG($R2$) and RCFG($R1$) represent continuation of the same control flow through multiple levels of the RST when the if condition in IR statement $S5$ evaluates to true.

The algorithm for constructing the RCFG’s for a given RST starts from the root node of the given RST, and builds the RCFG for each R-node recursively. The RCFG contains $k + 2$ nodes, where $k$ is the number of children of the R-node. A tricky part of the algorithm lies in propagating control flow edges to ancestor regions. If $T_i$ and $T_j$ are siblings in the RST, then exactly one edge is created in their parent’s RCFG. Otherwise, additional edges are inserted in all ancestor RCFG’s up to and including LCA($T_i, T_j$).

Each R-node $R$ maintains a Summary Use set and Summary Def set (shown in Figure 7.6.(b)) which includes summary references for use by its parent region $R_p$ as follows:

- If $R$ contains at least one def of a variable $V$, then include a single summary def of variable $V$ in $R$’s Summary Def set.
- If $R$ includes at least one use of variable $V$, then include a single summary use of variable $V$ in $R$’s Summary Use set.

Summary references can be initially computed in a bottom-up traversal of the RST, and can then be updated incrementally. More precise summaries can be obtained by suppressing the inclusion of a summary use of variable $V$ in $R$ when there is no upwards exposed and downwards exposed uses [1] of $V$ in $R_p$, and by suppressing the inclusion of a summary def of variable $V$ in $R_p$ when $V$ can be renamed [30] to a local/private variable in region $R$.

**Region-SSA Construction** A Region-SSA form structure is built for each RCFG. As mentioned earlier, the pseudo-instructions LoopRegionEntry and LoopRegionExit are used to annotate the R-node in its parent region. To enable the interaction between parent region $R_p$ and child region $R_c$, each child region $R_c$ stores summary Use/Def sets. These sets enable $R_c$ to be treated uniformly like other statements in $R_p$’s scope (except that $R_c$ may potentially contain more defs and uses than normal statements). $R_c$ also maintains a Use map and a Def map which maintain the map
S4: Loop Region Entry:
   Use Set \((y_1, 0, arr_1, 0, i_1, 0)\) Def Set \((y_1, 1, i_1, 1)\)
   Use Map: \((y_2, 0 \mapsto y_1, 0, arr_2, 0 \mapsto arr_1, 0, i_2, 0 \mapsto i_1, 0)\)
   Def Map: \((y_2, 1 \mapsto y_1, 1, i_2, 1 \mapsto i_1, 1)\)
S5: \(i_2, 1 = \Phi(i_2, 0 #S4, i_2, 2 #S15);\)
S6: \(y_2, 1 = \Phi(y_2, 0 #S4, y_2, 2 #S15);\)
S7: if \(i_2, 1 \geq M\) goto S16;
S8: if \(i_2, 1 == M\) goto S16;
S9: \(y_2, 2 = \text{invoke} \sqrt{i_2, 1};\)
S10: \(i10 = i_2, 1 - 1;\)
S11: \(i11 = arr_2, 0[i10];\)
S12: \(i12 = i11 / y_2, 2;\)
S13: \(arr_2, 0[i_2, 1] = i12;\)
S14: \(i_2, 2 = i_2, 1 + 1;\)
S15: goto S5;
S16: Loop Region Exit

(a) Region-SSA for Region R2

S1: \(x_1, 0 = 1;\)
S2: if \(y_1, 0 \leq M\) goto S17;
S3: \(i_1, 0 = 1;\)
S4: Loop Region Entry
   Use Set \((y_1, 0, arr_1, 0, i_1, 0)\) Def Set \((y_1, 1, i_1, 1)\)
S16: Loop Region Exit
S17: \(y_1, 2 = \Phi(y_1, 0 #S2, y_1, 1 #S16);\)
S18: \(y_1, 3 = x_1, 1 + 1;\)

(b) Region-SSA for Region R1

Figure 7.10: Region-SSA Example

between the Summary References and corresponding local variables in \(R_c\). Construction of Region-SSA form for a given \(RST\) is accomplished by a bottom up traversal of the \(RST\). It includes variable maps between the summary reference variables and their corresponding region local variables.

Figure 7.10 shows the translated Region-SSA for the example shown in Figure 7.6.(b). To demonstrate the region, we split the IR into two parts corresponding to region \(R1\) (the procedure level region shown in Figure 7.10.(b)) and \(R2\) (the loop level region shown in Figure 7.10.(a)). The Use/Def variable maps are listed after the region entry label. Compared with standard SSA form, Region-SSA also maintains a variable mapping between the parent and child regions. Thus, Region-SSA construction for the parent region only needs to examine the child region’s summary references, and not its code.

Out-of-Region-SSA Transformation The algorithm for transforming out of Region-SSA form maintains the correctness of a value propagation between an R-node \(R\) and its parent by inserting a prologue and an epilogue into \(R\)’s region. For prologue creation, the assignment operations are created from all of the variables in \(R\)’s Use set to their corresponding variables in \(R\)’s Use map, and inserted in front of the region entry node. Similarly, all of the variables in \(R\)’s Def set should be assigned by their corresponding variables in \(R\)’s Def map for creating the epilogue.
7.3. METHOD

Incremental Update Algorithm  Given a region \( R \), a compiler transformation can insert/delete statements and insert/delete uses/defs of variables. For region-local (private) variables, only the current region needs to be reconstructed. For inserted/deleted variables in ancestor regions, we enumerate the scenarios and rules for identifying the regions that need to be reconstructed as follows (the identified regions are put into a reconstruction list):

- Insert use variable \( u_p \) (\( u_p \) is a variable in parent region \( R_p \) and \( u_p \notin R's \ Use \) set):
  1. add \( u_p \) into the \( R's \ Use \) set;
  2. create a corresponding variable \( u_c \) for using in \( R \);
  3. add the pair \( u_c \mapsto u_p \) into \( R's \ Use \) map;
  4. add \( R \) into the reconstruction list;

- Remove a use variable \( u_p \) (\( u_p \) is a variable in parent region and \( u_p \in R's \ Use \) set):
  1. remove \( u_p \) from \( Use \) set;
  2. remove \( u_p \) related pair from \( Use \) map;
  3. add \( R \) into the reconstruction list;

- Insert a def variable \( d_p \) at statement \( S \) (i.e. \( d_p \) is a variable in parent region \( R_p \) and \( d_p \notin R's \ Def \) set):
  1. add \( d_p \) into \( R's \ Def \) set;
  2. create a corresponding variable \( d_c \) for using in \( R \);
  3. add \( d_c \mapsto d_p \) into \( R's \ Def \) map;
  4. if \( d_p \notin Use \) set and \( S \) does not dominate \( R's \) entry node, then
     (a) add \( d_p \) into \( R's \ Use \) set
     (b) create a corresponding variable \( d_{init} \) and add \( d_{init}\mapsto d_p \) into \( R's \ Use \) map;
  5. add both \( R \) and \( R_p \) into the reconstruction list;

- Remove a def variable: \( d_p \) (i.e. \( d_p \) is a variable in parent region \( R_p \) and \( d_p \in R's \ Def \) set):
  1. remove the variable from the \( Def \) set;
  2. remove this variable related pair from \( Def \) map;
  3. add both \( R \) and \( R_p \) into the reconstruction list;

The scenarios and rules list above handle the updating between the parent and child regions. These rules can also be applied recursively for handling the ancestor/child case. Given those regions in the reconstruction list, the Region-SSA construction algorithm is called to rebuild Region-SSA form.
Chapter 8
The Rose to LLVM Translator

The Platform-Aware Optimizer is implemented on top of the Rose infrastructure, while the Target-Aware Optimizer is implemented on top of the LLVM infrastructure. Thus, PACE needs a translator from the SAGE III IR used in the PAO to the LLVM IR used in the TAO. This chapter describes PACE-cc, the tool that implements this translation.

8.1 Introduction

Figure 1.2 provides a high-level overview of the PACE system design. This chapter focuses on the design of PACE-cc, a translator from the SAGE III IR to the LLVM IR. The SAGE III IR is an abstract syntax tree (AST) representation produced by the Rose compiler and used by the Platform-Aware Optimizer for transformations. The Target-Aware Optimizer operates on the LLVM IR, a linear code in static single-assignment form (SSA).

8.1.1 Motivation

The Platform-Aware Optimizer is implemented on top of the Rose infrastructure, while the Target-Aware Optimizer is implemented on top of the LLVM infrastructure. Thus, PACE needs a translator from the Sage III IR used in Rose to LLVM’s IR. PACE-cc is the Sage→LLVM translator used to generate LLVM’s bitcode, which is fed as an input to the Target-Aware Optimizer.

A critical aspect of the PAO/TAO interaction is the PAO’s use of the TAO as an oracle for feedback on the performance of potential code transformations. The PAO produces Sage IR for synthetic functions, which represent transformed versions of selected user code fragments for which the PAO needs cost estimates (see 5.3.5). Here too a translation to LLVM’s IR is needed by the TAO. The PACE-cc translator also implements this translation (Path 2 in Figure 5.2).

8.2 Functionality

8.2.1 Input

The translator is invoked in two distinct situations: as part of the full compilation path or the LLVM backend path (see § 3.4), and as part of a PAO-to- TAO query. In the first case, illustrated by Path 1 in Figure 5.2, the compiler driver invokes the translator, after the driver has invoked the PAO and before it invokes the TAO. The input to the translator along Path 1 is an AST in SAGE III IR with auxiliary information, and compiler directives passed by the compiler driver. These directives include optimization directives, some of which are generated by the PAO and instruct and constrain the TAO in its code transformations (see 5.2.2). The auxiliary information includes profile data, and information about aliases and dependences. To aid in vectorization, the auxiliary information

Principal Contacts For This Chapter: Philippe Charles, pgc1@rice.edu, and Michael Burke, mbg2@rice.edu
may include alignment information and sets of memory accesses (bundles). See Appendix A for a
detailed description of the auxiliary information needed for vectorization.

In the second case, illustrated by Path 2 in Figure 5.2, the PAO invokes the translator and pro-
vides it with an AST in SAGE III IR form for the synthetic function that it has created for the query.
Auxiliary information accompanies the AST, as in the first case. On this path, the PAO invokes the
translator and the TAO.

8.2.2 Output

PACE-cc produces as output, along Path 1, the LLVM IR that corresponds to the input AST, along
with LLVM metadata that provides links to the SAGE III IR auxiliary information described above. In
that the PAO and the TAO will share a single address space, PACE-cc will give the TAO access to the
SAGE III IR auxiliary information by constructing a global table of pointers to it and passing table
indices to the TAO by means of the LLVM metadata facility.

PACE-cc produces as output, along Path 2, the LLVM IR that corresponds to the input AST for
the synthetic function, along with LLVM metadata that provides links to the SAGE III IR auxiliary
information described above. Once again, the communication between the PAO and the TAO will
be facilitated by constructing a global table of pointers to the SAGE III IR auxiliary information and
passing table indices to the TAO by means of the LLVM metadata facility.

8.3 Method

To perform a translation, PACE-cc makes two passes over the SAGE III IR with Pre/Post-order visitor
patterns provided by Rose.

In the first pass, the translator generates attributes, associated with AST nodes, as part of the
analysis necessary for mapping C constructs into LLVM constructs. Attributes are added to the
AST to process global declarations and constants; map the C types into corresponding LLVM types;
process local declarations; generate temporaries and labels.

In the second pass, LLVM code is generated. Each RPU is mapped into an LLVM module. First,
global variables are processed, followed by aggregate types and function headers. Finally, code is
generated for each function body in turn.

Due to incomplete (and in some cases, incorrect) semantic processing in the Rose compiler
or semantic differences between C and LLVM, additional semantic analyses must be performed in
PACE-cc. LLVM, unlike C, is strongly typed. All these semantic issues are resolved in the first pass of
the translator using the SAGE III IR persistent attribute mechanism, without transforming the AST.

For example, instead of supporting a type for Boolean values, C uses the integer type to repre-
sent them. Boolean values often occur in a SAGE III IR representation, for example, as the result of
an intermediate comparison operation. In the SAGE III IR, these Boolean values are represented as
integers. LLVM has a bit type to represent Boolean values. The translator has to extend the SAGE III
IR AST (with attributes) to include the proper casting between integer and bit values.

The Rose compiler’s semantic processing of pointer subtraction is incorrect. The subtraction
of two pointers yields a pointer value instead of an integer value. The translator corrects this error
with the persistent attribute mechanism. Other issues of type include:

• The sizeof operator, whose value is not always correctly computed in the Rose compiler and
  not provided at all for structure types.

• Structure storage mapping.

• Integer (integral) promotion/demotion is not performed for op= operation on integer types.

For a given RPU input file, the SAGE III IR AST constructed by Rose is a complete representation
of the file after preprocessing expansion. To avoid code bloat, including code duplication, we do
not generate code for extraneous program fragments that are imported from header files by the C preprocessor but are not relevant to the file being translated.

Thus, translation requires more than a simple pass over the AST. However, the SAGE III IR supports two traversal modes, both of which use the Pre/Post-order visitor pattern. Using these two traversals, PACE-cc can achieve the desired effect. We start with a complete traversal of the main input files. A function `traverseInputFiles(SgProject *)` traverses only AST components whose definition originated from a main (.c) input source file. While processing the elements in the main input files, we record the external elements, defined in imported header files, on which they depend. A function `traverse(SgNode *)` is given an arbitrary starting node in the AST and will traverse the subtree rooted at the node in question. After traversal of the main input files, we traverse the recorded external elements, and record the imported elements on which they depend. This process continues until there are no remaining imported elements.

Hence, a pass over the SAGE III IR AST consists of an initial call to `traverseInputFiles(SgProject *)` to process the elements in each main input file, followed by invocations to `traverse(SgNode *)` to import the needed elements defined in imported header files. These are the elements that the main file depends on, directly or indirectly.

To further avoid traversing duplicate AST representations emitted by the Rose compiler for certain features, we add a facility for short-circuiting a traversal at a given node during a visit.

### 8.4 Example

Consider the following C program:

```c
int add(int x, int y) { return x + y; }

int main(int argc, char *argv[]) {
    int x = 5,
    y = 6,
    z = add(x, y);
    printf("z = %i\n", z);
}
```

This program consists of a main program and a local function `add`. In addition to some basic declarations and initializations, the main program contains a call to `add` and accesses two global entities: the external function `printf` and the string constant "z = %i\n". PACE-cc begins the translation of this C file with the following LLVM declarations for the global entities:

```llvm
"\01LC0" = internal constant [8 x i8] c"z = %i\0A\00"
declare i32 @printf(...)```
The LLVM code generated by PACE-cc for the add function is:

```c
define i32 @add(i32 %x, i32 %y) nounwind {
  .entry:
    %.retval = alloca i32, align 4 ; storage for return value
    %x.addr = alloca i32, align 4 ; parameter x
    %y.addr = alloca i32, align 4 ; parameter y
    store i32 %x, i32* %x.addr ; store value of parm x in mutable copy of x
    store i32 %y, i32* %y.addr ; store value of parm y in mutable copy of y
    %.tmp0 = load i32* %x.addr ; load x
    %.tmp1 = load i32* %y.addr ; load y
    %.add2 = add i32 %.tmp0, %.tmp1 ; x + y
    store i32 %.add2, i32* %.retval ; store (x + y) in the return value
    %.tmp8 = load i32* %.retval ; load return value
    ret i32 %.tmp8 ; return the return value
}
```

The LLVM code generated by PACE-cc for the main function is:

```c
define i32 @main(i32 %argc, i8** %argv) nounwind {
  .entry:
    %.retval = alloca i32, align 4 ; storage for return value
    %argc.addr = alloca i32, align 4 ; parameter argc
    %argv.addr = alloca i8**, align 4 ; parameter argv
    %x = alloca i32, align 4 ; int x
    %y = alloca i32, align 4 ; int y
    %z = alloca i32, align 4 ; int z
    store i32 %argc, i32* %argc.addr ; store value of argc in mutable copy of argc
    store i8** %argv, i8*** %argv.addr ; store value of argv in mutable copy of argv
    store i32 5, i32* %x ; initialize x to 5
    store i32 6, i32* %y ; initialize y to 6
    %.tmp3 = load i32* %x ; load x
    %.tmp4 = load i32* %y ; load y
    %.call15 = call i32 (i32, i32)* @add(i32 %.tmp3, i32 %.tmp4) ; add(x, y)
    store i32 %.call15, i32* %z ; z = add(x, y)
    %.tmp6 = load i32* %z ; load z
    %.call17 = call i32 (i32,...)* ; call printf(..., z)
    @printf(i8 * getelementptr ([8 x i8]* @"\01LC0", i32 0, i32 0), i32 %.tmp6)
    store i32 0, i32* %.retval ; store return value of 0
    %.tmp9 = load i32* %.retval ; load return value;
    ret i32 %.tmp9 ; return the return value
}
```

Note that the code generated for these two functions has a similar structure: a header statement similar to the C header statement; a variable declaration to hold the return value of the function (if needed); declarations of mutable local variables for the formal parameters (if any); declarations for user-declared local variables (if any); code generated to initialize the local variables associated with the parameters (if any); initialization code generated for user-defined local variables (if any); code generated for each executable statement in the body of the function.
Chapter 9
The PACE Target-Aware Optimizer

The Target-Aware Optimizer (TAO) is a major component of the PACE compiler. The TAO tailors application code to the target system in two different ways. It optimizes the code to better match the microarchitectural details of the target system’s processors, as revealed by the PACE system’s resource-characterization tools. It also optimizes the code to better match the capabilities of the native compiler that will be used to produce executable code. The TAO has one other use: the Platform-Aware Optimizer (PAO) can invoke the TAO to obtain information about how a specific segment of code will translate onto the target processor.

9.1 Introduction

The PACE compiler includes three major optimization tools: the Application-Aware Partitioner (AAP), the Platform-Aware Optimizer (PAO), and the Target-Aware Optimizer (TAO). Figure 1.2 describes the relationships between these three tools as well as the relationships between the TAO and other parts of the PACE system, such as the Resource Characterization tool (RC), the Runtime System (RTS), and the Machine Learning tool (ML). This chapter describes the functionality and design of the TAO, along with its interfaces to the rest of the tools in the PACE system. The TAO builds upon the open-source LLVM compilation system.

9.1.1 Motivation

Target-aware optimization sits between the PAO and the black-box interfaces of the vendor compilers through which the PAO sees the underlying hardware system on both the full compilation path and the LLVM backend compilation path. The TAO generates versions of the PAO-transformed application source code tailored to individual compilers and processors. To accomplish this task, the TAO must consider performance at a near-assembly level of abstraction, perform resource-aware optimization, and then either map the results of that optimization back into source code for the vendor compilers or invoke an LLVM backend. Key aspects of the TAO include:

Resource-specific Tailoring  The TAO uses knowledge from resource characterization to tailor the code for specific targets. For example, the RC might discover the relative costs of a variety of operations, including addition, multiplication, division, load, store, and multiply-add. The TAO should rely on this information when performing operator strength reduction [2, 28] and algebraic reassociation [15, 23].

Target-specific Improvement  The TAO uses compiler-characterization knowledge to optimize the code to both take advantage of strengths and compensate for weaknesses in vendor compilers.

Principal Contacts For This Chapter: Linda Torczon, linda@rice.edu
For example, if characterization shows that a vendor compiler does a poor job of optimizing array address expressions, the TAO might emit low-level, pointer-style C code and optimize accordingly. If the vendor compiler can use a multiply-add operation, that fact should change the expression shapes generated by algebraic reassociation [15, 23]. Similarly, operator strength reduction should change its behavior based on the availability of addressing modes and autoincrement or decrement.

**Novel Optimizations** The TAO provides a location where we can insert new optimizations into the toolchain without modifying the vendor compilers. For example, we have already built a tree-height restructuring pass that reorders chains of arithmetic operations to expose additional ILP [29]. Inserting this optimization into the TAO made it uniformly available across all PACE supported targets through both the full compilation path and the LLVM backend compilation path.

**Evaluation for the PAO** The TAO lowers the code to a near-assembly level, where the mapping between specific code segments and the target hardware (as seen through the vendor compiler) is more visible. Thus, the TAO has a clearer picture of the match or mismatch between decisions made in the PAO and the target hardware. For example, the TAO can provide direct feedback to the PAO on register pressure or available ILP based on either introspective measurement of optimization effectiveness or evaluation of a model. That feedback should improve the PAO’s ability to tailor its transformations to the target hardware.

### 9.2 Functionality

The TAO takes as input an optimized, annotated code fragment represented in the LLVM intermediate representation (LLVM IR), which is produced by the PAO and the ROSE-to-LLVM translator; system-characterization, compiler-characterization, and configuration information provided by the PACE RC; optimization directives provided by the PAO and the ML; and queries from the PAO. The TAO operates in two distinct modes: it is invoked by the compiler driver to produce optimized code from the output of the PAO, and it is invoked by the PAO as an oracle to obtain information about various properties of transformed code fragments produced by the PAO. As shown in Fig-
9.3. METHOD

ure 3.2 and described in the “Target-Aware Optimizer” section on page 30, the TAO supports three distinct execution paths: an LLVM IR to assembly code translation on machines where the underlying LLVM compiler has a native backend, an LLVM IR to C translation, and a PAO query and response path.

9.2.1 Interfaces

Figure 9.1 shows the interfaces supported by the TAO. The TAO takes, as its primary input, a code fragment represented in the LLVM intermediate representation (LLVM IR). Auxiliary information may be tied to that LLVM IR fragment, including analysis results from the PAO and runtime performance information from the RTS. When invoked by the compiler driver, the TAO receives as input: the LLVM IR, metadata associated with the IR, and optimization directives produced by the PAO and/or the ML components. In this mode it operates as a compiler and produces, as its primary output, a translated version of the LLVM IR code fragment, expressed in either C code for a specific native compiler or as optimized LLVM IR for an LLVM backend. When invoked by the PAO, the TAO operates as an oracle and produces, as its primary output, a data structure containing responses to PAO queries. More detailed descriptions of the PAO–TAO query interface can be found in §5.3.5 and §9.3.4.

The TAO consumes resource- and compiler-characterization information produced by the PACE RC. It uses resource characteristics, defined as performance-related properties of the target system, to change the behavior of optimizations that it applies to the code being compiled. The TAO uses compiler characteristics, defined as properties related to the effectiveness of the native compiler, to change the shape of the C code that it generates for the native compiler. The TAO also relies on information from the configuration file for the target platform, which is provided by the system installer. (See §3.2.2 for details.) The interface for information provided by the RC is described in §2.3.2.

To produce C code, the TAO uses the C backend interface in LLVM (/lib/Target/CBackend). The LLVM open source development team produced the C backend; we will modify this interface so that it adapts its behavior to the properties of the native compiler, as recorded in the compiler-characterization information gathered by the PACE system.

9.3 Method

The following sections describe the PACE approach to four aspects of the TAO design: optimization in LLVM, vectorization, selecting optimization sequences, and producing answers to PAO queries.

9.3.1 Optimization in LLVM

When invoked by the compiler driver, the TAO is presented with user code expressed as one or more LLVM IR procedures; when invoked by the PAO, the TAO is presented with a fragment of encapsulated synthetic code expressed in the LLVM IR. Under both scenarios, the TAO will apply a sequence of optimizations to the IR form of the code; the sequence will consist of both existing optimizations from the LLVM framework and new optimization passes developed for the LLVM/TAO framework as part of PACE. The specific optimizations (and possibly their order) is dictated by concrete optimization directives contained in the optimization plan; the TAO takes these optimization directives as one of its inputs.

If the compiler driver directs the TAO to generate native code using an LLVM backend, the TAO optimizes the LLVM IR and then invokes the specified LLVM backend as in a normal LLVM compilation. If the compiler driver directs the TAO to generate C code, the TAO will follow optimization by a series of passes that reshape the code for effective use by the native compiler. These transformations will control the expression of the LLVM IR fragment in C. They will adjust the number
of simultaneously live values, the use of specific C data structures, and the forms used to express parallel loops.

**Optimization Directives** Each invocation of the TAO uses optimization directives produced by the PAO and/or the ML components (§ 5.3.7). The precise details of the optimization directives, including what they specify and how they express that information, are the subject of ongoing discussion. Our goal is for the optimization directives to be capable of expressing both specific sequences of optimization, as in *perform algebraic reassociation followed by operator strength reduction*, and high-level goals, such as *optimize to minimize code and data space*.

The TAO bases its processing of the application’s code on the content of the optimization directives. Those directives might, in turn, come from the ML as the distillation of prior experience. They might come from the PAO, based on the PAO’s analysis of the code. Further, the TAO can use compiler characteristics provided by RC, such as properties related to the effectiveness of the native compiler, as a basis for modifying compiler directives. In an extreme case, the end user might produce distinct optimization directives to exert direct control over the TAO’s behavior.

If no optimization directives are provided, the TAO will use a generic optimization plan as its default. Thus, the optimization directives play a critical role in adapting the compiler’s behavior to a specific application and, to a lesser extent, to a specific target system.\(^1\) For this scheme to work, PACE must correlate information about specific optimization plans, applications, and their resulting performance. To simplify this process, the TAO will embed a concrete representation of its optimization plan in the code that it produces. For more detail, see § 3.2.2.

**Transformations** The open-source LLVM compiler system already includes a large set of optimization passes that implement a substantial set of transformations. The TAO will both build on the existing LLVM code base and use pre-existing passes from LLVM.

1. Some LLVM passes will be used without change. For example, LLVM includes a set of passes that eliminate “dead” code or data. While unified algorithms are available that would reduce the internal complexity of dead code elimination in LLVM, the existing passes are functional and effective. Since neither target-system characteristics nor application characteristics factor into dead code and data elimination, PACE will use those passes without modification.

2. Some LLVM passes will be modified to use data produced by other components in the PACE system. Much of our work in the TAO will focus on this task. PACE produces three major kinds of information that are of interest to the TAO: characterization information produced by the PACE RC tool, auxiliary information passed into the TAO from the PAO, and optimization directives as described earlier in this section.

**Using Characterization Data:** Figure 9.2 shows the characteristics measured by PACE in Phase 1 that the TAO uses. Figure 9.2 also lists some of the applications for that data in the TAO’s transformations. We will modify the existing transformations, as appropriate, to use this data. We have begun to assess the ways that the TAO can use characterization data in individual transformations; this assessment is an ongoing process that will continue into Phase 2 of the project.

**Using IR Auxiliary Information:** In PACE, the TAO always runs after the AAP and the PAO. This enforced order means that the TAO can rely on results from analyses performed in the AAP and the PAO that are relevant to the TAO. The PAO will pass analysis results to the TAO as auxiliary information to the LLVM IR; the ROSE-to-LLVM translator will map the auxiliary information to the LLVM IR while translating the ROSE IR to LLVM IR. This auxiliary information

---

\(^1\)We anticipate that most of the application-independent target adaptation occurs as a result of resource characterization and the use of characterization-driven optimizations.


### Memory System Characteristics

<table>
<thead>
<tr>
<th>Characteristic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>I-Cache Size</td>
<td>Comparison against code size for feedback to loop unrolling</td>
</tr>
</tbody>
</table>

### Processor Characteristics

<table>
<thead>
<tr>
<th>Characteristic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Operations in Flight</td>
<td>Computation of ILP for feedback to PAO, as well as input to the query backend for scheduling</td>
</tr>
<tr>
<td>Operation Latencies</td>
<td>Algebraic reassociation, operator strength reduction, as well as input to the query backend for scheduling</td>
</tr>
</tbody>
</table>

### Native Compiler Characteristics

<table>
<thead>
<tr>
<th>Characteristic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Live values</td>
<td>Adjusting register pressure in the C code produced by TAO, as well as input to the query backend for register allocation and scheduling.</td>
</tr>
<tr>
<td>Array versus Pointer Addressing</td>
<td>Code shape decisions in the C backend</td>
</tr>
<tr>
<td>Parallel versus Sequential Loops</td>
<td>Code shape decisions in the C backend</td>
</tr>
<tr>
<td>Speedup from Multiply-Add</td>
<td>Code shape decisions in the C backend</td>
</tr>
</tbody>
</table>

Figure 9.2: PACE Phase 1 Characteristics Used by the TAO

...may include aliasing information, dependence information, and runtime profile information (derived by the RTS and mapped onto the code by the AAP and PAO). See § 5.2.2 for a description of the auxiliary information produced by the PAO and § 8.2.2 for a description of the mapping process.

3. Some LLVM passes will be modified to improve their effectiveness. We have begun to study the effectiveness of optimization in LLVM with the goal of identifying weaknesses in specific optimization passes in the existing LLVM code base. We will modify the transformations to address those measured weaknesses, whether they are implementation issues or algorithmic issues. In addition, the construction of the query backend (see § 9.3.4) may necessitate extensive modification to the register allocator and instruction scheduler.

To understand optimization effectiveness in LLVM, we have used the NULLSTONE compiler benchmark suite\(^2\) to compare LLVM's performance against other compilers, such as gcc and icc. The NULLSTONE analysis has identified several weak spots. We are expanding our study to use other benchmarks; we are using tools from the RTS to assess performance of the resulting code.

4. Finally, we will implement some novel transformations in LLVM. These transformations will target specific opportunities identified by our analysis of LLVM’s effectiveness, or opportunities created by the transformations used in the PAO.\(^3\) Our preliminary studies have identified several potential additions, including a pass that performs algebraic reassociation of expressions and a value-range propagation and optimization pass. We have already built a tree-height restructuring pass that reorders chains of arithmetic operations to expose additional ILP [29].

\(^2\) NULLSTONE is a registered trademark of the Nullstone Corporation. The NULLSTONE compiler performance suite is a proprietary product that we are using as part of our design process to assess compiler strengths and weaknesses. We intend to use the suite as part of our internal regression testing, as well. The NULLSTONE code is not part of PACE nor is it required to install or use the final system.

\(^3\) In practice, many optimization algorithms contain implicit assumptions about the properties of the code being compiled. The PAO transformations will, almost certainly, create code that contains optimization opportunities that appear rarely, if ever, in code written by humans.
We anticipate that this activity will involve a combination of implementing known algorithms from the literature and inventing new algorithms. We will focus on finding effective solutions to the underlying performance problems.

For each transformation pass included in the TAO, we will examine the question of how that pass can use resource-characterization information, application performance data, and analytical knowledge derived in the PAO. We anticipate that the results of that investigation will lead to further improvements.

9.3.2 Vectorization

When the PAO invokes the TAO for generating short SIMD vector code, the PAO passes the TAO the LLVM IR of a function; alignment and aliasing information, including data dependence information; and bundle information, which describes memory accesses to consecutive memory locations, as hints. The LLVM IR of the function contains metadata that indicates the innermost loop body that needs to be vectorized in the TAO. This innermost loop body is made amenable to vectorization in the PAO by using several compiler transformations such as loop peeling, loop unrolling, and loop fusion. Such transformations preserve the validity of the data dependence information in the PAO.

The TAO performs vectorization before any other standard compiler transformation is applied. It builds a dependence graph for each basic block of the annotated innermost loop nest using the dependence information from PAO, and performs a dynamic-programming-based vector code generation using bundles. The dynamic programming uses a cost-based model to determine optimal vector code for the input LLVM IR. Register pressure is well-integrated into the cost-based model.

The details of how the PAO invokes the TAO and the vectorization algorithm is provided in Appendix A.

9.3.3 Selecting Optimization Sequences

Choosing which optimizations to apply is a critical part of the design process for any optimizing compiler. In PACE, the selection of specific transformations has several components. First, the AAP, the PAO, and the TAO address different concerns; this separation of concerns leads to some division in the set of transformations that the various tools will implement. (See §3.5 for the division of transformations among the AAP, the PAO, and the TAO.) Second, the PAO and ML may suggest specific optimization directives for a given compilation. Third, the RTS will provide the compiler with information about application performance that can inform and guide optimization decisions in the TAO.

Finally, in the TAO, the results of the RC’s characterizing the native compiler should inform both the selection of specific optimizations and the decisions made about how to shape the code that the TAO generates. For example, if the TAO discovers that some inline substitution in the AAP has made a function too long for the native compiler to handle, it can modify the AAP optimization plan so that the parameters that govern the AAP’s decision algorithm for inlining or, perhaps, specify that the AAP should not inline that procedure. Phase 2 will expand the the set of compiler properties that the RC tools measure, enabling this sort of modification of the optimization plan by the TAO (§2.2.3).

External Guidance The TAO accepts external guidance on optimization in the form of optimization directives. The TAO is responsible for the optimization plan mechanism and the implementation, but not for the generation of optimization directives. Directives may be generated by the PAO and/or the ML. In an extreme case, an end user might create a custom optimization plan to precisely control the process. The TAO may modify compiler directives based on native compiler
characteristics provided by RC.

The TAO may elect to implement some parts of its optimization plan with transformations provided in the native compiler. If compiler characterization shows, for example, that the native compiler has a strong redundancy elimination pass, then the TAO may omit its own redundancy elimination pass.

The TAO may also receive external guidance in the form of performance information from the RTS, passed into the TAO from the PAO as auxiliary information to the LLVM IR form of the code. Performance information can influence optimization, ranging from decisions about path frequencies and code motion through the placement of advisory prefetch operations. As we renovate optimizations in LLVM, we will look for opportunities to apply runtime performance information to improve code quality.

**Targeting the Native Compiler**  A critical aspect of the TAO’s mission is to shape the optimized source code produced for the input application so that the vendor compiler can produce efficient code for it. The TAO derives its understanding of the strengths and weaknesses of the vendor compiler from the compiler-characterization results produced by the PACE RC. From the TAO perspective, those tools measure effective capacities of low-level resources available to compiled C code. Examples include the number of simultaneously live values that the compiler can sustain without spilling and the amount of instruction-level parallelism that it can use productively. The compiler characterization tools also discern code shape issues that improve the performance of the code generated by the native compiler. Examples include support for multiply-add instructions and the performance tradeoff between array addressing expressions and pointers.

The TAO effort will explore using compiler characterization information to tailor the C code that it generates for the native compiler. For example, if the vendor compiler can use a multiply-add operation, that fact may change the expression shapes generated by algebraic reassociation [15, 23]. In particular, if the native compiler expects the multiply and add instructions to appear in a particular order, the TAO will need to generate the instructions in the correct order. If compiler characterization shows that a vendor compiler does a poor job of optimizing array address expressions, the TAO might emit low-level, pointer-style C code and optimize accordingly. Similarly, operator strength reduction should change its behavior based on the availability of addressing modes and autoincrement or decrement.

### 9.3.4 Producing Answers to PAO Queries

When the PAO invokes the TAO as an oracle, the PAO passes the TAO a synthetic code fragment encapsulated in a function; standard PAO–TAO auxiliary information, including profile and alias information; and a query data structure requesting particular information. The synthetic function will consist of a code region that contains a single loop nest, a loop body, or an entire function.

The TAO produces, as its primary output, an updated query data structure containing metric information on the synthetic code that it would have compiled. Examples of PAO queries include requests for an estimate of register pressure, critical-path length in a code region, or ILP. Details related to the types of queries that the PAO will generate can be found in §5.3.5.

When responding to queries from the PAO, the TAO may provide additional feedback to the PAO on the effectiveness of the transformed code produced by the PAO. For example, if the TAO scheduler detects that there is too little ILP in the PAO-provided synthetic code fragment, the TAO will inform the PAO that there is insufficient ILP when it responds to the PAO’s original query. If the TAO finds that register pressure is too high in a loop, it will inform the PAO; the PAO will know to transform the code in a way that reduces the demand for registers. If the TAO identifies a loop that would benefit from software pipelining, it will inform the PAO; the PAO may remove control flow to support the decision to software pipeline the loop.
We are exploring two possible strategies for producing responses to PAO queries: a query back-end that uses measured resource characteristics to model the target architecture and produce answers to PAO queries based on the resulting code, and a low-cost scheme that estimates answers to PAO queries without performing code generation. We will study the efficiency and accuracy of the two approaches as well as the quality of the final code generated by the two approaches. We will use these results to determine whether we should select a single approach or use some combination of the two approaches to produce answers to PAO queries.

**Using a Query Backend to Generate Responses** When the TAO is invoked to answer specific queries from the PAO (as described in § 5.3.5), it will optimize the code just as if it were generating code for the synthetic functions. If the PAO does not provide optimization directives, the TAO will use its default optimization plan to imitate the native compiler optimization. After optimization, the TAO will invoke a generic backend designed specifically for the query synthetic function.

The query backend will use a simple RISC ISA, such as the SimpleScalar ISA, parameterized by the system characteristics measured by the PACE RC. The query backend will generate code for the ISA and directly measure characteristics of the generated code, such as the machine code size in bytes. The query backend will also measure the code generator’s behavior in an introspective way. (For example, the register allocator will record how much spill code it generates and the instruction scheduler will record the schedule density and efficiency.) These direct and introspective measures form the basis for replies to the set of queries that the TAO supports.

**Using Low-Cost Estimates to Generate Responses** The prototype version of the PAO/TAO query interface will compute the following low-cost estimates for each synthetic function: (1) \textsc{maxlive} to estimate register pressure; (2) \textsc{spillcost} of the generated code; (3) critical-path length estimate; (4) cost of SIMDizing the synthetic function. The \textsc{maxlive} information is combined with the RC characteristics for the register file of the underlying architecture to determine the \textsc{spillcost} of the generated code using a simple register allocation algorithm such as graph coloring. Another non-trivial point to consider is for the native compiler to take the exact same decision of spilling as that of TAO while generating code for the synthetic function. This involves communicating spill decisions to the native compiler either using register attributes for non-spill variables or by moving spilled locals to aliased memory locations.

The critical-path (CP) length estimate is an indication of the amount of instruction-level parallelism available in the synthetic function. For straight-line and acyclic code regions in the synthetic function, CP is determined using a dependency graph of the LLVM IR instructions. Each such instruction is associated with a cost stored in the RC. A simple depth-first traversal of the dependency graph yields the critical-path length estimate of the synthetic function. For code regions with multiple control flow paths, we can use either of the following approaches: (1) compute critical path length for each control flow path and weight them based on profile information; (2) control dependence based critical path length estimate. Approach (1) needs accurate profile information to limit the combinatorial explosion of the number of control flow paths.\footnote{We are still deciding on the technique for dealing with cyclic code regions and software pipelining.}

SIMDization is an important optimization performed in the PAO. The PAO would like to know the cost of current high-level SIMDization performed for the synthetic function and if possible, would like to get a feedback on any improved SIMDization using code shaping and SIMD code selection. This analysis in the TAO requires cost estimates for various vector instructions and the length of the vector unit from the RC unit. Our approach in the TAO is to build the data dependence graph and perform a vector code selection algorithm based on these costs.

\footnote{We are currently exploring the feasibility of basing the query backend on LLVM’s Alpha native code generator. The Alpha backend was designed to be compatible with the SimpleScalar simulator.}
The above described estimates are computed in an efficient manner in terms of time and space. The results are accumulated in a data structure and fed back to PAO when the TAO is invoked in-core for a synthetic function.
Chapter 10

The PACE Runtime System

The principal role of the PACE Runtime System (RTS) is to gather performance measurements of a program execution to support compile-time feedback-directed optimization and online selection of parameters, such as tile sizes and scheduling policies. At a minimum, the RTS uses an interval timer to measure time consumed in various parts of a program. By identifying costly regions in a program, the RTS can direct the PACE Compiler where to focus optimization. If hardware performance counters are available, RTS uses them to gather additional information about resource consumption and inefficiency; such information provides detailed insight into opportunities for improving performance on a target platform. This information can help the PACE Compiler identify appropriate optimizations needed to improve performance.

10.1 Introduction

The purpose of the PACE Runtime System (RTS) is to measure the performance of program executions with three aims: to help identify important program regions worthy of intensive optimization, to provide data to support feedback directed optimization, and to provide a harness that supports measurement-driven online parameter selection. Here, we describe the functionality and design of RTS, along with its interfaces to other components in the PACE system. The performance monitoring infrastructure of RTS builds upon Rice’s HPCTOOLKIT performance tools [59]—open-source software for measurement and analysis of application performance.

10.1.1 Motivation

With each generation, microprocessor-based computer systems have become increasingly sophisticated with the aim of delivering higher performance. With this sophistication comes behavioral complexity. Today, nodes in microprocessor-based systems are typically equipped with one or more multicore microprocessors. Individual processor cores support additional levels of parallelism typically including pipelined execution of multiple instructions, short vector operations, and simultaneous multithreading. In addition, microprocessors rely on deep multi-level memory hierarchies for reducing latency and improving data bandwidth to processor cores. At the same time, sharing at various levels in the memory hierarchy makes the behavior of that hierarchy less predictable at compile time.

As the complexity of microprocessor-based systems has increased, it has become harder for applications to achieve a significant fraction of peak performance. Attaining high performance requires careful management of resources at all levels. To date, the rapidly increasing complexity of microprocessor-based systems has outstripped the capability of compilers to map applications onto them effectively. In addition, the memory subsystems in microprocessor-based sys-
systems are ill suited to data-intensive computations that voraciously consume data without significant spatial or temporal locality. Achieving high performance with data-intensive applications on microprocessor-based systems is particularly difficult and often requires careful tailoring of an application to reduce the impedance mismatch between the application’s needs and the target platform’s capabilities.

To improve the ability of the PACE Compiler to map applications onto modern microprocessor-based systems, the PACE RTS will collect detailed performance measurements of program executions to determine both where optimization is needed and what problems are the most important targets for optimization. With detailed insight into an application’s performance shortcomings, the PACE Compiler will be better equipped to select and employ optimizations that address them.

10.2 Functionality

Figure 1.2 shows the major components of the PACE system and the interfaces between them; in that figure, the RTS components appear in blue. Figure 10.1 shows the inputs and outputs of the PACE RTS. The RTS will provide support for guiding online and offline optimization. This support comes in several forms:

- Runtime monitoring of metrics that can be measured using timers and/or hardware performance counters.
- Attribution of metrics to static and dynamic program contexts.
- A framework for providing performance profile information to (1) the AAP to support application partitioning, and (2) the machine learning tools and the PAO to support offline feedback-directed optimization.
- A framework for runtime parameter selection based on measured metrics.

The measurement subsystem of the RTS monitors the performance of an executable in machine-code form. There are two ways in which the measurement subsystem can be used: it can be statically linked into an executable at program build time, or for dynamically-linked executables, it can be pre-loaded into the application’s address space at launch time. In either case, when the program is launched, the measurement subsystem is initialized, environment variables

![Figure 10.1: PACE Runtime System inputs and outputs.](image-url)
10.2. FUNCTIONALITY

set by a measurement script are read to determine what to monitor, and then execution begins with monitoring enabled.

10.2.1 Interfaces

There are several interfaces between the RTS and the rest of the PACE system.

- An RTS measurement script will drive application characterization by the measurement subsystem. The script will repeatedly execute an application to survey performance metrics that will be used to guide compilation.

- The RTS measurement subsystem will interpose itself between the application and the operating system on the target platform to intercept program launch and termination, creation and destruction of threads and processes, setup of signal handlers, signal delivery, loading and unloading of dynamic libraries, and MPI initialization/finalization.

- A profiler associated with the measurement subsystem will analyze binary measurement data recorded by the measurement system and produce call tree profiles in XML form that will be read by the Application-Aware Partitioner (§4.3.1).

- A performance analyzer associated with the PACE RTS will digest performance profile data in XML format and provide an XML file that contains high-level quantitative and qualitative guidance to the Platform-Aware Optimizer about resource consumption, costs, and inefficiencies.

10.2.2 Input

The components of the PACE RTS receive several kinds of inputs from other parts of the PACE system.

Measurement script. An RTS measurement script will drive application characterization by repeatedly executing an application under control of the measurement subsystem to survey performance metrics that will be used to guide compilation. Inputs to the measurement script are an application in machine code form and a specification of a test input for the program (arguments, input files, etc.). What characteristics to measure will be derived from the hardware counters available on the target platform. If a characteristic is to be measured using asynchronous sampling, the RTS will choose an appropriate period for sampling the characteristic. The compiler drive provides a default measurement script in the application’s working directory.

Online feedback-directed optimizer. The RTS will include a harness to support online feedback-directed optimization. During compilation, the Platform-Aware Optimizer (PAO) may determine that certain parameters may benefit from runtime optimization (Section 5.3.8). The PAO will present the RTS with a closure that contains an initial parameter tuple, a specification of the bounds of the parameter tuple space, a generator function for exploring the parameter tuple space, and a parameterized version of the user’s function to invoke with the closure containing the parameter tuple and other state.

10.2.3 Output

The RTS measurement subsystem will produce a raw profile XML document that will associate static and/or dynamic contexts (which can include call paths, procedures, loops, and line numbers in source files) annotated with measured values of performance metrics, including call counts. It stores the raw profile document in an appropriate subdirectory of the application’s working directory. The RTS performance analysis subsystem will augment the raw profile XML document with
The RTS performance analysis subsystem will register the name of the executable, the time of a run, and the location of the performance information produced by the RTS with the PACE Compiler and the PACE Machine Learning tools using callbacks provided by each of these subsystems. Within the PACE Compiler, call tree profiles collected by RTS are used by the Application-Aware Partitioner to guide the process of creating refactored program units (§4.3.1). RTS performance profiles will also be used by the Application-Aware Partitioner (§ 4.6.9) and Platform-Aware Optimizer (§ 5.3.9) to support feedback-directed changes to the application’s optimization to improve memory hierarchy utilization by adjusting data layouts (e.g. adding inter-variable or intra-variable padding; transposing arrays) and adjusting the code shape as necessary.

10.3 Methods

10.3.1 Measurement

The PACE Runtime System must accurately measure and attribute the performance of fully optimized applications. It is important to have an accurate measurement approach that simultaneously exposes low-level execution details while avoiding systematic measurement error, either through large overheads or through systematic dilation of execution. For this reason, the PACE RTS will build upon Rice’s HPCTOOLKIT performance tools [59] as the basis of its measurement subsystem. The measurement subsystem will record profiles in a collection of files in a compact binary form that associates metric values with the static and/or dynamic contexts (identified by machine-code addresses) where the metrics were measured. Below, we outline the methods used for measuring application performance.

Asynchronous sampling. HPCTOOLKIT primarily uses asynchronous sampling rather than instrumentation to measure performance. Asynchronous sampling uses a recurring event trigger to send signals to the program being profiled. When the event trigger occurs, a signal is sent to the program. A signal handler then records the context where the sample occurred. The recurring nature of the event trigger means that the program counter is sampled many times, resulting in a histogram of program contexts. Asynchronous sampling can measure and attribute detailed performance information at a fine grain accurately as long as (1) code segments are executed repeatedly, (2) the execution is sufficiently long to collect a large number of samples, and (3) the sampling frequency is uncorrelated with a thread’s behavior. Under these conditions, the distribution of samples is expected to approximate the true distribution of the costs that the event triggers are intended to measure.

Event triggers. Different kinds of event triggers measure different aspects of program performance. Event triggers can be either asynchronous or synchronous. Asynchronous triggers are not initiated by direct program action. HPCTOOLKIT initiates asynchronous samples using either an interval timer or hardware performance counter events. Hardware performance counters enable HPCTOOLKIT to statistically profile events such as cache misses and issue-stall cycles. Synchronous triggers, on the other hand, are generated via direct program action. One example of an interesting event for synchronous profiling is lock acquisition; one can measure the time per call to look for lock contention.

Call path profiling. Experience has shown that comprehensive performance analysis of modern modular software requires information about the full calling context in which costs are incurred. The calling context for a sample event is the set of procedure frames active on the call stack at the time the event trigger fires. We refer to the process of monitoring an execution to record the calling
contexts in which event triggers fire as call path profiling.

When synchronous or asynchronous events occur, the measurement subsystem records the full calling context for each event. A calling context is a list of instruction pointers, one for each procedure frame active at the time the event occurred. The first instruction pointer in the list is the program address at which the event occurred. The rest of the list contains the return address for each active procedure frame. Rather than storing the call path independently for each sample event, we represent all of the call paths for events as a calling context tree (CCT) [5]. In a calling context tree, the path from the root of the tree to a node corresponds to a distinct call path observed during execution; a count at each node in the tree indicates the number of times that the path to that node was sampled.

Exposing calling patterns. Besides knowing the full calling context for each sample event, it is useful to know how many unique calls are represented by the samples recorded in a calling context tree. This information enables a developer interpreting a profile to determine whether a procedure in which many samples were taken was doing a lot of work in a few calls or a little work in each of many calls. This knowledge in turn determines where optimizations should be sought: in a function itself or its call chain. To collect edge frequency counts, we increment an edge traversal count as the program returns from each stack frame active when a sample event occurred. We do this by having the trampoline function increment a “return count” for the procedure frame marked by the sentinel as it returns. A detailed description of this strategy can be found in our prior work [35].

Coping with fully optimized binaries. Collecting a call path profile requires capturing the calling context for each sample event. To capture the calling context for a sample event, the measurement must be able to unwind the call stack at any point in a program’s execution. Obtaining the return address for a procedure frame that does not use a frame pointer is challenging since the frame may dynamically grow (space is reserved for the caller’s registers and local variables; the frame is extended with calls to alloca; arguments to called procedures are pushed) and shrink (space for the aforementioned purposes is deallocated) as the procedure executes. To cope with this situation, we developed a fast, on-the-fly binary analyzer that examines a routine’s machine instructions and computes how to unwind a stack frame for the procedure [66]. For each address in the routine, there must be a recipe for how to unwind the call stack. Different recipes may be needed for different intervals of addresses within the routine. Each interval ends in an instruction that changes the state of the routine’s stack frame. Each recipe describes (1) where to find the current frame’s return address, (2) how to recover the value of the stack pointer for the caller’s frame, and (3) how to recover the value that the base pointer register had in the caller’s frame. Once we compute unwind recipes for all intervals in a routine, we memoize them for later reuse.

To apply our binary analysis to compute unwind recipes, we must know where each routine starts and ends. When working with applications, one often encounters partially stripped libraries or executables that are missing information about function boundaries. To address this problem, we developed a binary analyzer that infers routine boundaries by noting instructions that are reached by call instructions or instructions following unconditional control transfers (jumps and returns) that are not reachable by conditional control flow.

HPCTOOLKIT ’s use of binary analysis for call stack unwinding has proven to be very effective, even for fully optimized code [66]. At present, HPCTOOLKIT provides binary analysis for stack unwinding on the x86_64, Power, and MIPS architectures. On architectures for which HPCTOOLKIT lacks a binary analyzer for call stack unwinding, where available we will use libunwind [53], a multi-platform unwinder that uses information recorded by compilers to unwind the call stack. libunwind currently supports ARM, IA64, x86, x86_64, MIPS, and PowerPC architectures.
Flat profiling. On some platforms, support for call stack unwinding might not be available. On these platforms, the measurement subsystem will use simpler profiling strategy and collect only program counter histograms without any information about calling context. This form of profiling is referred to as flat profiling. Even such simple profiling can quantitatively associate costs with program regions, which can serve to guide a compiler as to where optimization is most important.

Maintaining control over parallel applications. To manage profiling of an executable, HPC-TOOLKIT intercepts certain process control routines including those used to coordinate thread/process creation and destruction, signal handling, dynamic loading, and MPI initialization/finalization. To support measurement of unmodified, dynamically linked, optimized application binaries, HPCTOOLKIT uses the library preloading feature of modern dynamic loaders to preload a profiling library as an application is launched. With library preloading, process control routines defined by HPCTOOLKIT are called instead of their default implementations. For statically linked executables, HPCTOOLKIT provides a script that arranges to intercept process control routines at link time by using linker wrapping—a strategy supported by modern linkers.

Handling dynamic loading. Modern operating systems such as Linux enable programs to load and unload shared libraries at run time, a process known as dynamic loading. Dynamic loading presents the possibility that multiple functions may be mapped to the same address at different times during a program’s execution. During execution, the measurement subsystem ensures that all measurements are attributed to the proper routine in such cases by dividing an execution into intervals during which no two load modules map to overlapping regions of the address space.

10.3.2 Profile Analysis

For measurements to be useful, they must be correlated with important source code abstractions. Profiles collected by the measurement subsystem will be digested by hpcprof, a tool that will correlates measured metrics with static and dynamic contexts at the source code level. hpcprof produces a profile XML document that associates static and/or dynamic contexts (which can include call chains, procedures, loops, and line numbers in source files) annotated with measured metric values. Here, we briefly outline the methods used by hpcprof to correlate profile data with static and dynamic application contexts.

Correlating performance metrics with optimized code Measurements are made with reference to instruction addresses in executables and shared libraries; it is necessary to map measurements back to the program source for them to be of much use. To associate sample-based performance measurements with the static structure of fully optimized binaries, we need a mapping between object code and its associated source code structure. HPCTOOLKIT’s hpcstruct constructs this mapping using binary analysis; we call this process recovering program structure.

hpcstruct focuses its efforts on recovering procedures and loop nests, the most important elements of source code structure. To recover program structure, hpcstruct parses a load module’s machine instructions, reconstructs a control flow graph, combines line map information with interval analysis on the control flow graph in a way that enables it to identify transformations to procedures such as inlining and account for transformations to loops [66].

Several benefits naturally accrue from this approach. First, HPCTOOLKIT can expose the structure of and assign metrics to what is actually executed, even if source code is unavailable. For example, hpcstruct’s program structure naturally reveals transformations such as loop fusion and scalarized loops implementing Fortran 90 array notation. Similarly, it exposes calls to compiler support routines and wait loops in communication libraries of which one would otherwise

1Without line map information, hpcstruct can still identify procedures and loops, but is not able to account for inlining or loop transformations.
be unaware. hpstruct’s function discovery heuristics expose distinct logical procedures within stripped binaries.

**Identifying scalability bottlenecks in parallel programs**  By using differential analysis of call path profiles collected by the measurement subsystem, the RTS will pinpoint and quantify scalability bottlenecks in SPMD parallel programs [22, 68]. Using a technique we call *blame shifting*, one can attribute precise measures of lock contention, parallel idleness, and parallel overhead in multithreaded programs [65, 67]. Combining call path profiles with program structure information, HPCTOOLKIT can quantify these losses and attribute them to the full calling context in which they occur.

**10.3.3 Analyzing Measurements to Guide Feedback-directed Optimization**

Identifying performance problems and opportunities for tuning often requires synthesizing performance metrics from two or more hardware performance counters. In general, our plan is to calculate and attribute *wasted cycles* associated with various features in a program.

We can measure or estimate exposed memory latency from hardware performance counters. Using instruction-based sampling support in AMD Opterons [32], one can measure the memory latency observed by an instruction directly. On systems that support only event-based sampling, we plan to estimate memory latency by multiplying numbers of cache misses at each level by their measured latency. When hardware counters permit, we plan to estimate exposed memory latency by combining measurements of total latency with measures of memory parallelism made with other hardware counters. We plan to measure and attribute the cost of pipeline stalls due to integer operations, floating point operations, and mispredicted branches. We will estimate total delay due to mispredicted branches in each context by multiplying the number of mispredicted branches by the delay each one causes. We will also compute instruction balance measures that will show the ratios of memory accesses, integer operations, branches, and floating point operations.

These metrics will highlight opportunities for improving efficiency that can be targeted by feedback-directed optimization in the PACE Platform-Aware Optimizer.

**10.3.4 Online Feedback-directed Parameter Selection**

The RTS will provide a harness to be used for online feedback-directed parameter selection. This harness can be used to select parameter settings for tilings and select among code variants. As input to this harness, the PACE Platform-Aware Optimizer will provide a closure (§5.3.8) that includes the following information:

- A function that represents a parameterized region of application code. This code takes as input the closure.
- A parameter tuple that represents the current parameter setting. Initially, this tuple will contain the PAO’s best estimate of the optimal parameter settings.
- The bounds of the parameter tuple space that needs to be searched.
- A generator function that takes as inputs (1) the current parameter tuple, (2) a map from parameter tuples to a vector of metrics that represent observed performance, and (3) the bounds of the parameter tuple space. The generator function will return the next parameter tuple, which may be the same as the current parameter tuple.
- A set of performance metrics that will be used to assess the goodness of a particular parameterization of a code region. Metrics may include time and perhaps hardware performance counter measures.
• Inputs other than the parameter tuple needed by the region of parameterized code.

• A flag that indicating whether or not this is the first use of this closure.

• A map between parameter tuples and runtime performance metrics. This map may be initially empty, or it may be partially filled in with information from the knowledge base.

The RTS will provide a harness for online feedback-directed optimization that uses this closure in the following way. If this is not the first invocation of the harness, the generator function will be invoked with the current parameter tuple and a map from tuples to a vector of measured metrics. The generator function will determine the next parameter tuple to try if the current parameter tuple is not satisfactory. The harness will arrange to measure the performance metrics specified. The harness will then call the parameterized application code using the current parameter tuple. The measured performance metrics for this tuple will be added to a map of tuples to metric vectors.

We expect to code a standard library of generator functions. Some generator functions may be as simple as an exhaustive search of the parameter space. Others may perform a sophisticated exploration of the parameter space using algorithms such as direct search, hill climbing, or other optimization techniques. In our design, the nature of the generator functions and the representation for a parameter tuple is of no consequence to the RTS harness, which merely needs to be able to invoke the provided components in the aforementioned manner. For that reason, we expect to use the same harness to perform online feedback-directed optimization for a multiplicity of purposes, including selection of tiling and scheduling parameters.

Results of the online feedback-directed optimization will be recorded in the application’s working directory, where they will be accessible by the PACE Machine Learning tools to help improve both the initial parameter tuple and the parameter spaces suggested by the PAO, and accessible by the compiler to improve its subsequent optimizations of the same code.

10.4 Results

The core of the PACE RTS measurement subsystem based on HPCToolkit is operational. Call path profiling using binary analysis is supported on Linux systems based on x86_64, Power32, Power64, and MIPS processors. Using libunwind, the measurement subsystem will soon support call path profiling on ARM, IA64, x86 processors. The HPCToolkit-based measurement infrastructure uses the PAPI library for hardware performance counter measurements. On older Linux kernels, PAPI is supported using either the Perfmon2 or Perfctr kernel patches. As of Linux kernel 2.6.32, PAPI uses the built-in perf events drivers, which are enabled by default.

The measurement subsystem can collect performance profiles for timer and hardware performance counter events. hpcprof digests profiles from the measurement subsystem and assembles them into a profile XML document. The PACE Application-Aware Partitioner reads call tree profile XML documents produced by hpcprof and uses them to construct refactored program units.
Chapter 11
Machine Learning in PACE

11.1 Introduction - Machine Learning for Compiler Optimization

11.1.1 Motivation

The central objective of the PACE project, which is to provide portable performance across a wide range of new and old systems, and to reduce the time required to produce high-quality compilers for new computer systems, can greatly be helped by machine learning.

As an example, consider a problem in the PACE context: Given a program, a target system and a compiler, predict a good compiler configuration, such as a list of compiler flag settings which yields fast execution for the program. We shall refer to this problem as the “flag-setting problem”. The selection of optimizations is part of the PACE compiler optimization plan; in particular, the generation of optimization directives (§ 3.2.3). The selection of optimizations that yields fast execution (optimum performance, in general) depends on the characteristics of the target system, the characteristics of the program being compiled, and the characteristics of the compiler. The relationship between the flag settings and the performance can be viewed as a relationship among points in a multidimensional space, spanned by the variables which characterize the program being compiled, the target system, the compiler flag settings and the performance.

To address this problem, a human designer uses past experience by remembering and applying a list of compiler flag settings used for similar programs encountered before; or by constructing a good list of settings based on trial runs of the program of interest. Thus the success of the designer depends on the ability to remember past experience, on the ability to distill, abstract, and generalize knowledge from past experience, and on the ability to spot patterns in the complex multidimensional space of non-linear interactions. This, in itself, is a formidable task. Furthermore, all this experience and knowledge might become irrelevant if the target system changes, and it would involve massive effort to re-acquire the relevant knowledge to be able to use the compiler effectively in a new target system. This is the central problem that the PACE project seeks to remedy. To remedy this problem, automation is needed to effectively and efficiently characterize the platform interactions: the interactions between programs, target systems, and compilers and use this characterization to optimize these interactions.

Machine learning aims to develop models of such complex relationships by learning from available data (past experience or from controlled experiments). The learned models facilitate discovery of complex patterns and recognition of patterns of known characteristics, in huge, unorganized high-dimensional parameter spaces, thereby making optimization tasks tractable and aiding in intelligent decision making.

The machine learning group of the PACE effort is concerned with developing techniques to...
learn from the complex multidimensional data spaces that characterize the often non-linear interactions between programs, target system, and compiler optimizations. The result of the learning—the knowledge, captured in learned models of relevant optimization scenarios—can then be deployed and used in a variety of PACE related tasks such as compile-time program optimization (for speed, for memory usage, etc.), or for resource characterization. Moreover, with certain machine learning techniques, the models deployed after initial satisfactory off-line training could learn continuously in a run-time environment. This not only enables their use as oracles but allows ongoing improvement of their knowledge based on run-time feedback about optimization success.

11.1.2 Prior Work

Machine learning for compiler optimization is a relatively new area, with much unexplored territory. The following is a summary of what has been accomplished in the past ten years, showing some demonstrable but not dramatic performance improvements. This leaves significant opportunities for further advances in this area. Prior work can roughly be divided into two categories, machine learning for optimization and machine learning to characterize platform interactions.

Machine learning for compiler optimization

Stephenson et al. use genetic programming (genetic algorithms applied specifically to programs) to determine priority functions used in compiler optimizations [64]. Priority functions are used extensively in compiler optimization heuristics. For example, in instruction scheduling algorithms, priority functions are used to assign priorities to instructions which in turn determine the instruction schedule (in general, the order of resource allocation.) When compared to hand-tuned priority functions used in the Trimaran compiler, a program-specific priority function for hyperblock formation yields an average improvement of about 25% in running time for the SpecInt, SpecFP, and Mediabench benchmark suites. A program agnostic priority function yields about 9% improvement on the average. Further discussion on the applicability of genetic algorithms to compiler optimization can be found in § 11.3.3. Cavazos et al. have used logistic regression, which is a technique to compute statistical correlation, to determine method-specific optimization settings [20] in Jikes RVM for a set of benchmarks drawn from SPECjvm, SPECjbb and DaCapo suites. The authors report improvements in execution time ranging from an average of 4% over −O0 optimization level with a corresponding improvement of 5% in total running time (the sum of the program execution time and the JVM), to a 29% (and 0%) improvement over −O2.

A similar approach has been used for the SPEC 95 FP, SPEC 2000 FP and INT, Polyhedron 2005, and MiBench benchmarks in the EKOPath compiler [19]. Average improvement of 17% in running time over all benchmarks over -Ofast setting (the highest optimization setting in the EKOPath compiler) has been reported. Agakov et al. construct Markov models to predict the effectiveness of optimizations and use this to inform an iterative search to determine good optimization sequences. This approach yields about 33% improvement in running time on a TI processor, after 5 rounds of searching whereas random search yields only about 32% improvement even after 50 rounds of searching.

Machine learning to characterize platform interactions

Cooper et al. and Almagor et al. [24, 4] characterized the space of compiler optimizations and its impact on the performance. The authors report that randomly evaluating 4 neighbors (the 4 most similar sequences) of a given optimization sequence yields more than 75% probability of finding a better optimization sequence. Furthermore, 13% of local minima are within 2% of the best possible performance and about 80% of local minima are between 2% and 2.6% of the best possible performance, making descent algorithms with random restarts an ideal candidate to search for good
11.2. FUNCTIONALITY

optimization sequences. Joshi et al. attempt to use target system independent metrics to group similar programs from a benchmark suite [42]. The aim is to determine a representative subset of programs.

The reader is referred to “Survey of Machine Learning for Compilers” by the PACE machine learning group in the Rice PACE repository for a more thorough survey and comments on the strengths and weaknesses of each of these works.

The need for further development

This surveyed body of work demonstrates that machine learning can be successfully used to specialize compilers to new architectures (by tuning priority functions, for example). Though performance improvements have been reported, the effectiveness of machine learning itself has not been documented in most cases. For example, in the context of compiler optimization [19], it is not clear whether performance improvements arise from good decisions made by effective learning or from choosing randomly from a list of pre-filtered flag settings known to yield good performance. Joshi et al. achieve poor results in platform characterization. For example, representative programs (as determined by their technique) have an average cache miss rate which is about 40% more than the average cache miss rate of the entire benchmark suite. Thus further development is needed to (1) separate and quantify the effectiveness of the learning process itself and (2) to adopt more sophisticated machine learning techniques with the aim of effecting more dramatic performance increase in compiler optimizations.

11.2 Functionality

11.2.1 What Machine Learning Will Accomplish

Machine learning will be used to effectively and efficiently characterize the complex interaction

![Figure 11.1: An overview of the PACE system](image)

Machine learning will be used to effectively and efficiently characterize the complex interaction
between program characteristics, target system characteristics and compiler characteristics. This will be useful for solving problems encountered in several PACE tasks. As shown in Figure 11.1, which is an annotated overview of the PACE system presented in Figure 1.2 (§ 1.2.1), machine learning (ML) engines (marked ML1, ML2, ML3, ML4 and MLxx in rectangular boxes) are envisioned to help with the tasks in the Platform Aware optimizer (PAO), the Target Aware Optimizer (TAO) and the Run Time System (RTS). These engines correspond to the four PACE tasks identified in § 11.2.2 as likely to benefit from machine learning. From the point of view of the Run Time System the relevant ML engines will supplement the generator function (described in § 10.2.2) to help with the tasks of the RTS such as online feedback-directed parameter selection (described in § 10.3.4).

These ML engines will be provided data about the target system characteristics 1⃝, program characteristics 2⃝ and compiler characteristics 3⃝, by the subsystems where these circled numbers are indicated. Thus each of the circled numbers correspond to an arrow from the corresponding PACE subsystem to the ML subsystem.

Machine learning is data-driven therefore, the availability of known instances is essential. For example, revisiting the flag setting problem, machine learning can be used to learn the relationship between the program being compiled, the target system, the compiler flag settings and the performance from known instances. Briefly, as shown in Figure 11.2, a mapping $Y = f(X)$ exists from elements of an input space $X$ (the program, compiler and target system characteristics) to elements of an output space $Y$ (the compiler flag settings), where $f$ is unknown. The role of machine learning is to construct a model based on known instances (known input-output pairs or labeled training data), which approximates the mapping $f$ as well as possible, based on the quality of the training instances. In the context of supervised learning, assuming the availability of a set $X_{\text{labeled}}$ of known input-output pairs, elements from a subset $X_{\text{labeled}}^{\text{training}} \subset X_{\text{labeled}}$ are used by the machine learning system to construct a model by adjusting model parameters so that a good approximation of the actual mapping $f$ is learned. A good learning process results in good generalization of the learned model, i.e., the model will make good predictions for patterns which were not part of the training set $X_{\text{labeled}}^{\text{training}}$. The learned model is then used for predicting a list of compiler flag settings for good performance for new programs that will be encountered by the PACE system.

11.2.2 Optimization Tasks Identified for Machine Learning

Four tasks have been identified as likely candidates to benefit from machine learning. The corresponding envisioned machine learning engines are indicated in Figure 11.1 in rectangular boxes labeled ML1 through ML4. In the context of compiler optimization, we use the term "good performance" to mean performance, in terms of execution time, code size or some other metric, which is reasonably close to the optimal performance or is a dramatic improvement over the baseline (unoptimized) performance.

1. Determination of tile size to optimize performance of a nested loop (ML1 in Figure 11.1)
2. Determination of compiler flag settings for good performance of a program (ML2 in Figure 11.1)
3. Prediction of program performance based on program characteristics (ML3 in Figure 11.1)
4. Determination of a good sequence of compiler optimizations for good performance of a program (ML4 in Figure 11.1)

For each of these tasks, the input variables (input features that make up the input feature vectors) will include descriptors of the target system, descriptors of the program, and the compiler, while output variables (output features) may be program performance indicators, compiler flag settings, or optimization sequences, as dictated by the particular ML task. The input and output variables can and will vary across different versions of models - typically models of progressive
levels of complexity - for a task. For example, an initial, simple version of ML Task 1 (prediction of good tile sizes) may be designed for a single hardware platform in order to establish data need and baseline success without the complication of multiple platforms. In this case system characteristics need not be described in the input feature vector since they would be the same for all inputs. Once a simple model is shown to make acceptable predictions, we can proceed to set up a more complex model by including system characteristics such as cache sizes, line sizes, associativity, etc., in the input feature vector. The previously studied simple model will also help estimate data need for training a more complex model. Another reason for varying input and output features through different models for the same ML Task is to test the descriptive power of different sets of variables which may characterize the same properties. (For example, both the number of cache misses and the number of stall cycles can characterize the same aspect of the memory subsystem performance.) Selection of variables is guided by the accumulated experience of compiler experts, both within and outside the PACE teams, and may require separate models to work with non-overlapping sets of variables recommended by different expert groups. For these reasons, in this design document we are providing sets of typical variables that will likely be used, in various combinations, throughout a number of models that we will develop for each of the ML1 - ML4 Tasks. The specific set of variables for each model will be decided at the time a particular model is considered, and will often depend on the outcome of experiments with a previous model. We are including one specific feature set, as an example, for our first concrete model for Task ML1, at the end of §11.2.2. Working lists of relevant variables, determined by PACE team members as well as adopted from literature, are maintained in the PACE Owl space in PACE Resources/Machine Learning/Data Source/Variables for ML.xlsx file and will be revised as we accumulate experience.

Variables which capture the relevant target system characteristics will be obtained from the resource characterization (RC) subsystem of the PACE system. The target system characteristics, indicated as 1⃝ in Figures 11.1 and 11.3, for which measurement methodologies have been built so far are listed in §2.2.3. Program characteristics indicated as 2⃝ in Figures 11.1 and 11.3 will be obtained from the PAO subsystem and the RTS. Compiler characteristics, indicated as 3⃝ in Figures 11.1 and 11.3, will be obtained from the TAO subsystem and the Native Compiler (NC).
On a more general level we should point out that selection of appropriate input and output variables that describe causes and consequences needs to be done in two contexts. The first is a determination and listing of variables that potentially carry important information for the given problem. Such variables must be provided by compiler experts based on their understanding of how to represent relevant properties of programs, target systems, etc., and on their experience with program optimization. ML experiments should start with using as complete subsets of these expert-recommended variables as possible, and as appropriate for models of progressively increasing levels of complexity.

Once satisfactory prediction performance is achieved with ML for a given (version of a) task, we have a baseline of how well a model can perform when using all possible expert-recommended variables (pertinent to the given version of a task). The number of these variables, however, can be very large even if we do not count alternative variables for the description of similar causes (e.g., program properties). This is the second context in which variable selection should be considered, now as a subselection from the set of variables that were used to obtain the baseline results. Obviously, elimination of variables must not result in decline of prediction quality. Deselection of variables can be done with various dimensionality reduction approaches. Dimensionality reduction approaches that involve a transformation of the feature space, such as Principle Components Analysis (PCA), make it difficult or impossible to relate the transformed variables to the known meaningful quantities described by the original variables. Therefore, approaches that can assess the relative importances of the dimensions (variables) in the original feature space are much more advantageous. Approaches for the assessment of the relative importances of variables can be divided into two groups also from another point of view. The majority of available techniques make a determination with no regard to a known analysis objective. For example, in view of a known classification goal the important variables may be vastly different from those determined without taking this goal into account. PCA, for example, would eliminate data based on statistical significance (as derived from the magnitude of the eigenvalues). However, this may eliminate the information needed to separate small classes, or to separate classes with slight differences, meaningful for the given problem. Furthermore, linear techniques, and techniques that use low-order statistics only, may miss relevant variations in the data. For all these reasons, non-linear techniques that can also take classification goals into account should be preferred for the determination of rel-
evant variables, in the case of complicated data such as we have in PACE. One recent technique is relevance learning, published originally as GRLVQ (Generalized Relevance Learning Vector Quantization, [38]) and further developed specifically for high-dimensional data (GRLVQ Improved, [46]). These are two of very few available methods that jointly optimize a classification goal and the relevance weighting of the variables (see overview in [46]). GRLVQ(I) are non-linear machine learning approaches. We now describe the four machine learning tasks in greater detail.

**Determine tile size to maximize performance of a nested loop**

Given a nested loop in a program, the tile size that minimizes the average cost of memory access for data accesses from the loop, yields the best possible performance for the loop. Thus tile sizes that yield good performance can be determined by predicting the average cost of memory access corresponding to several instances of tile sizes and selecting a good instance. The selection of good tile sizes by the machine learning engine illustrated in Figures 11.1 and 11.3 and marked “ML1”, would be helpful in program optimization tasks in the PAO as well as in the RTS where run time decisions on tile sizes in parametrized tiled code could be performed.

The average cost of memory access is the result of complex interaction between the memory hierarchy of the target system and the loop that uses a specific tile size. To illustrate key variables and their complex interaction, we use a vastly simplified example, but we emphasize that machine learning can be used for much more complex cases. (In fact, the whole point of machine learning is to be able to derive, from known examples, such complicated models of input / output relationships that cannot be given by closed formulae or easy-to-describe rules)

Consider a nested loop in Code Fragment A that accesses elements from a matrix of size $N \times N$.

A.1. For $i = 1$ to $N$
A.2. For $k = 1$ to $M$
A.3. For $j = 1$ to $N$
A.4. $\text{Matrix}[i,j]$
A.5. End For
A.6. End For
A.7. End For

**Code Fragment A: Untiled Loop Nest**

This loop can be made efficient if the elements Matrix[i,j] accessed in line A.4 can be cached and reused. However, traversing the entire length of the matrix before reusing the elements of a row might be inefficient, since the entire row of the matrix might not fit into the cache. Hence, one method of making the code efficient could be to transform the code to process the matrix “tile by tile” such that each tile fits into the cache and is completely processed before moving to the next tile. The corresponding code might look like this:

B.1. For $\text{Length} = 0$ to $N$/tile_length - 1
B.2. For $\text{Breadth} = 0$ to $N$/tile_breadth - 1
B.3. For $k = 1$ to $M$
B.4. For $i = \text{Length} \times \text{tile_length} + 1$ to $\text{Length} \times \text{tile_length} + \text{tile_length}$
B.5. For $j = \text{Breadth} \times \text{tile_breadth} + 1$ to $\text{Breadth} \times \text{tile_breadth} + \text{tile_breadth}$
B.6. $\text{Matrix}[i,j]$
B.7. End For
B.8. End For
B.9. End For
B.10. End For
B.11. End For

**Code Fragment B: Parametrically Tiled Loop Nest**

In the equivalent code in Code Fragment B, the iteration space of the loop is divided into “tiles”. Lines B.1 and B.2 loop over the first tile, second tile, third tile, ..., $T^{th}$ tile. Lines B.4 and B.5 visit
the points in the iteration space of a given tile. A tile size that is too small would lead to poor performance, since the loop body may not benefit from prefetching. A tile which accesses a piece of the matrix that is too big to fit into the cache, may cause misses in the cache adding memory overhead to the loop code.

- **Target system characteristics** (for each level of the memory hierarchy) such as
  1. The size of the cache, \( L_1 \text{ cache size} \) for the L1 cache
  2. The size of a cache line, \( L_1 \text{ line size} \) for the L1 cache
  3. The associativity of the cache, \( L_1 \text{ associativity} \) for the L1 cache
  4. The replacement policy of the cache, \( L_1 \text{ replacement} \) for the L1 cache

- **Program characteristics** such as
  5. The size of array(s) along each dimension, \( \text{size}_{i,j} \) for the \( j^{th} \) dimension of array \( i \)
  6. The index expression for each dimension \( j \) of the array \( i \), \( \text{expr}_{i,j} \)
  7. The loop iteration range of loop \( i \), \( \text{range}_{i} \)
  8. The size of padding for each dimension \( j \) of the array \( i \), \( \text{padding}_{i,j} \)
  9. The number of loop nesting levels, \( n\text{-nesting} \)

- **Compiler characteristics** such as
  5. For every loop level \( i \) the tile size, \( \text{tile size}_{i} \)
  6. Row or Column major layout, \( \text{layout} \)

Given these variables as input and corresponding execution time (as proxy for cost of memory access) for known instances in parameter regions which do not yield poor performances in an obvious manner, the machine learning system will build (learn) models that characterize this complex interaction. Thus, the learned model can be used for rapid search through the parameter space of (reasonable and not obviously disadvantageous) tile sizes to predict the execution time without having to run the program.

It may seem surprising that we predict execution time corresponding to an input tile size and post process rather than the intuitive approach of predicting good tile sizes directly. This is because several tiles might yield the same execution time and therefore the mapping from execution time to tile size, which is a one-to-many mapping would be difficult, if not impossible to learn in a supervised learning framework (as depicted in Figure 11.2). In contrast the many to one mapping of tile sizes to execution time can be learned. We describe this design decision again in § 11.3.1 and the more general philosophy.

Finally, we give a concrete example of a specific version of the model for ML1 task along with the specific variables we use in the training of that model. This model is the first and simplest version of the ML1 task where the target system is kept constant and therefore we do not need variables to characterize the target system. To describe program characteristics in this case we chose tile size (\( \text{tile size}_{i} \)) the number of accesses and misses in the first and second levels of the data cache respectively (\( \text{L1CDA}, \text{L1DCM}, \text{L2DCA}, \text{L2DCM} \)), the number of accesses and misses in the TLB (\( \text{TLBDA}, \text{TLBDM} \)), and the number of vector instructions which have been executed (\( \text{VECINS} \)) as elements of the input feature vector to predict execution time. The use of execution time as proxy is based on expert opinion that the execution time is a linear function of the average cost of memory access. Likewise, ignoring the effects of vectorization, instruction-level parallelism, out of order execution etc. is based on expert opinion that these aspects do not affect the process of learning the mapping between tile sizes and execution time. Based on the understanding of the descriptive power of the variables included in this simple model, more variables may be considered in a subsequent more complex model. Concretely, the subsequent model we plan will include variables which characterize the effectiveness of hardware prefetch strategy. We think that this will
improve the accuracy of predictions and will help generalize our model across loop bodies and across target systems. The added variables would be the average number of memory references (in a single iteration of the innermost loop) that could and could not be prefetched by the target system \((n_{PF}, n_{NPF})\). The reason for developing our model in an incremental fashion is to separate and understand the various aspects of the interaction between the program and the target system as well as to get a good grasp on the amount of training data required for good model building.

**Determine selection of compiler flag settings for good performance of a program**

Typically a compiler has several flags which turn optimizations on or off, set parameters for various optimizations, and so forth. For example, the flag -finline-functions-called-once, requests the gcc compiler to inline all functions which are called only once. Given a program, a target system and a compiler, one problem is to determine a list of flag settings which produces compiled code with good performance. In the PACE context, the setting of such flags and parameters is part of the generation of optimization directives and parameters for the optimization plan (§ 3.2.3).

The number of choices given \(k\) flags is typically exponential in \(k\). The metric of the quality of the compiled code could be the execution time of the code or the size of the compiled code. In PACE, such flags are passed from the PAO to the TAO as directives (5.2.2). The machine learning engine marked as “ML2” in the Figures 11.1 and 11.3 will assist the PAO in selecting flags for good application performance.

The complexity of this problem arises from the fact that typical compilers have tens to hundreds of flags with an ever larger number of combinations of these flag settings. Furthermore, the effectiveness of specific optimizations depends on the interaction between the characteristics of the program, the target machine and other optimizations performed by the compiler. For example, function inlining may be beneficial, harmful or have no impact on the performance depending on

1. The effect on the instruction cache
2. The effect on the register pressure
3. The effect on other optimizations like constant propagation, common sub expression elimination etc.

Thus the optimal list of compiler flag settings is influenced by

- **Target system characteristics such as**
  1. The characteristics of the memory hierarchy of the target system described above
  2. The size of each type of register file, for example, \(int\) \(reg\) \(size\) for the integer register file size, \(float\) \(reg\) \(size\) for the floating point register file size and so on
  3. The number of each type of functional unit, \(FP\) \(mul\) \(num\) for the number of floating point multipliers, for example
  4. The length of the pipeline, \(pipeline\) \(length\)
  5. The penalty for branch misprediction, \(miss\) \(predict\) \(penalty\) in number of cycles
  6. ···

- **Program characteristics such as**
  5. The dynamic instruction ratio for each type of instruction \(i\), \(dynamic\) \(inst\) \(ratio_i\)
  6. The static instruction ratio for each type of instruction \(i\), \(static\) \(inst\) \(ratio_i\)
  7. The ratio of backward branches to total number of branches, \(forward\) \(branch\) \(ratio\)
  8. The average rate of branch mispredictions, \(branch\) \(mispredict\) \(ratio\)
  9. ···
• Compiler characteristics such as
  5. Callee vs. caller saved registers, *calling convention*
  6. · · ·

By learning from known instances of the mapping between the list of variables which correspond to the characteristics enumerated above and the list of desired flag settings, the desired list of flag settings for a new program will be determined by machine learning. The desired list of flag setting is that list which achieves performance reasonably close to the optimal performance of the compiled code.

**Predict program performance based on program characteristics**

Consider the following scenario where there are two target systems \(A\) and \(B\) whose characteristics are known. For a set \(S\) of programs, the execution characteristics are known for each of the programs in \(S\) on the target system \(A\). For a subset \(S' \subset S\) of programs, the execution characteristics are known for the execution on the target system \(B\). By learning from the execution characteristics of all programs on \(A\) and the execution characteristics of some of the programs on \(B\), the machine learning system will be used to predict the performance of a program \(P \in S \setminus S'\) when \(P\) is executed on the target system \(B\). This engine, ML3 in Figures 11.1 and 11.3 will aid the application characterization task of the PACE system where predicted application performance (and performance of parts of applications such as procedures and loop bodies) serve as an indicator of application bottlenecks. This engine will also aid the RTS system where predicted application performance can serve as a basis for decisions regarding where and when to apply run time optimizations.

**Determine a good sequence of compiler optimizations for good performance of a program**

In typical compilers, not only can optimizations be turned on or off, the *order* in which various optimizations are applied and the number of times they are applied can be controlled as well. For example, optimizations such as dead code removal, common sub-expression elimination, constant propagation and inlining may be performed in an arbitrary order for an arbitrary number of times. Thus one frequently encountered problem is to determine the *sequence* of compiler optimizations to perform to yield good performance, where each optimization may be applied zero or more times. In the PACE context, this problem corresponds to item 5 in the optimization plan (§ 3.2.3).

We distinguish between the task described in § 11.2.2 (ML2) and the task described here (ML4) as follows: In ML2, the task is to determine a selection of flag settings with no implied order of optimizations while in ML4 the problem is to determine a sequence of optimizations which yields good performance. These sequences can be of arbitrary length with possible repetition of optimizations. The corresponding learning engine is marked as "ML4" in Figures 11.1 and 11.3. Of the four tasks that have been identified in this section, this task is the least defined and least understood due to issues elaborated in § 11.3.2. Consequently the accomplishment of this task carries higher uncertainty than that of the other tasks.

The issues involved in effective learning, different machine learning approaches and the challenges associated with applying machine learning in the PACE context are discussed in the next section.

### 11.3 Methodology

#### 11.3.1 Abstraction of PACE Problems For Machine Learning

We developed a framework for expressing compiler optimization problems as machine learning tasks. This is illustrated by the schematics in Figure 11.4 for the specific problem of tile size deter-
mination, described under § 11.2.2. The input and output feature spaces, shown for the general case in Figure 11.2, are determined by what feature(s) we want to learn from what other features. This is explained below through the specific example of determination of optimum tile size for a given loop.

The optimal tile size depends on several characteristics of the target system, the program, and the compiler, such as the number of distinct references made to the elements of the matrix and the spatial relationship of these references and their interaction with target system characteristics (as discussed in § 6.3.6).

Thus the input feature space which describes the multi-dimensional space of these variables could include the variables listed on page 108. The performance of a loop body with a particular tile size may be quantified using the average cost of memory access in cycles for each of the memory access in the loop body. In this case, this is the (single) dependent variable (a single output feature) that we want to be able to predict. This set of input and output variables span the multidimensional space which is the feature space for this task. Vectors in this feature space are called feature vectors and instances of known corresponding input - output feature vectors form the input-output pairs which will be used to train a supervised machine learning algorithm (as shown in Figure 11.2).

The total execution time of a loop body is a linear function of the average cost of memory access in most circumstances known to us. Therefore, we can use the execution time as a proxy in ML predictions. Specifically, we assume that the following factors do not distort the linear relationship:

1. Instruction-level parallelism
2. Out of order execution
3. Branch prediction accuracy
4. Compiler optimizations such as constant propagation and strength reduction
5. Other target system artifacts such as accuracy of prefetching

![Figure 11.4: Machine learning schematics for the tile size optimization problem](image)

We note that it would seem more intuitive for this particular example to predict the tile size (use tile size as the output feature) and include the execution time in the inputs, but the execution time (and average cost of memory access of the loop body) has a one-to-many mapping to tile sizes, which is hard if not impossible to learn. The many-to-one mapping from tile sizes to the execution time (and the average cost of memory access of the loop body) can be learned. From predicted performances the favorable set of tile sizes can be filtered quickly by simple post-processing.

---

3 Factors such as vectorization will have an impact on execution time, though the impact will most likely be the same across different tile sizes. This ensures that the linear relationship between the average cost of memory access and the total execution time across tile sizes is not distorted. There will be corner cases, such as one of the dimension of the tile size being 1, where vectorization might have a dramatically less effect on performance but we ignore or filter out such corner cases.
The design of a feature space is a significant effort. It may require multiple phases beyond the initial abstraction exercise. For example, it should be ensured that variables relevant to the problem are captured in the input feature vector, else machine learning (and any learning) will be ineffective. Given a feature space, instances of the feature vector for which the target output features are known - labeled samples - should be generated or acquired. These constitute the training data from which the machine learner learns a model of the relationship between target system characteristics, program characteristics, compiler characteristics, and performance. The relevant features may not always be known in advance. If the learned model performs poorly one reason can be that some important feature has not been taken into account, which may warrant revision of the feature space, which in turn will necessitate a repeat of the learning experiments.

11.3.2 Challenges From a Machine Learning Point Of View

Compiler optimization involves a large number of variables in the input space (several dozens at least), and often also in the output space (dozens to over a hundred compiler flags, for example). The variety of complex interactions among the features results in a large number of patterns of behavior each of which requires a different optimization sequence to increase performance. This creates a learning task to map multi-variate inputs to multi-variate outputs, both potentially high dimensional, and to delineate many classes (or precisely distinguish many degrees of some quantity such as execution time or tile size). The number of machine learning paradigms capable of dealing with such complexity of learning, is limited, and even the capable ones may not have been demonstrated on quite as ambitious tasks as those envisioned in PACE. Our experience from prior work [51, 69, 60, 48, 72, 73] with excellent machine learning performance on data that represent some of these or similar challenges (in a different application domain) will be utilized in this project. An additional challenge is that the variables in the PACE feature spaces are often mixed (disparate) types. This makes it hard to express or assess their relative importance, which in turn brings in issues of scaling and measures, both important for the success of machine learning. We are bringing considerable experience to PACE on this subject as well (e.g., [46] and references therein).

The specific ML technique for a particular ML task will depend on the nature of the task (regression, classification, clustering), the required resolution / precision of the prediction, the expected complexity of the mapping from input to output feature space, the dimensionality of the feature space, the amount and quality of training data available, the prediction capabilities of the given ML technique, and the computational expense.

Both supervised and unsupervised learning schemes will be used: supervised learning for regression (function approximation, prediction of continuous variables), or for classification, and unsupervised learning for clustering. Candidate learning approaches are discussed in some detail under §11.3.3.

The impact of training data on machine learning

For learning complicated relationships among features, typically a large number of labeled patterns is needed for training, which may not exist or may be hard to acquire. A careful design of training data is critical, in any case, to ensure sufficient number of labeled samples and appropriate coverage and distribution over the problem space, for adequate representation. The availability and the time needed to generate training data is also an important aspect to be considered.

To test the performance of the learned model test samples are used. Test samples are labeled samples which are known to the model developers but not used for the training of the model, and which are set aside for the evaluation of the model’s performance on data that the model has not learned from. The extent to which a learned model can make good predictions for unseen samples (samples outside the training set) is called the generalization capability. Producing models with good generalization capability is the main objective of machine learning. Sampling theories
prescribe the number of test samples necessary for statistically significant assessment of the generalization capability. The requisite number can be very high for problems involving many classes and high-dimensional feature vectors.

The quality of the training data is also important. Noise and mislabeling are frequent adverse effects. Noisy data may require more samples to learn well, especially where classes are close in the feature space. Incorrect labeling can confuse the learner and decrease its performance. Careful evaluation of the noise situation, and verification of truth labels is imperative, and may take a few iterations, since the combined effect of noise and class proximities are usually not known; and incorrect labeling sometimes is only discovered from the learning itself. The above may necessitate revision of the training data and repeating of the learning experiment a few times in order to converge on an effective learned model.

Alternative to supervised machine learning: clustering

Clustering, a major type of unsupervised machine learning (Figure 11.5) is of fundamental importance, for two reasons. One is that good manifold learning that precisely maps the structure of the feature space enables discoveries of pattern groupings, and relationships among them. For example, we may discover previously not known program or compiler behaviors, or interactions between them. Another reason is that knowledge of the cluster structure can greatly assist in achieving subsequent accurate supervised classification and regression, by enabling fine discrimination of classes with subtle (but consistent) differences. Examples of these in earlier work (from a different application domain), where the feature space comprised hundreds of input features and up to several dozens of classes, include [48, 51, 60, 48, 72, 73].

Another use of clustering that will very likely have a significant role in PACE tasks, is the following. When labeled training data are scarce, we can cluster the feature vectors, take some summary descriptors (such as the averages) of the clusters as the typical behavior of the members of the clusters, and develop expert treatment for these cluster representatives. Then the members of each cluster can be expected to benefit from the same treatment. New feature vectors (at run time, for example) can be assigned to existing clusters by the trained model thereby indicating what treatment should be applied. This is illustrated in Figure 11.6, where the input feature vectors consist of descriptors of the target system, the program to be compiled, and the performance of the program.
with default compiler settings, and the resulting clusters represent categories of program behaviors. Through the (off-line) post processing indicated by the black rectangles the clusters can be labeled for treatment with appropriate optimization sequences developed for the discovered clusters. Bundled together, the clustering engine and the canned optimization sequences can serve as a run-time oracle-and-optimization unit.

Figure 11.6: Unsupervised machine learning scheme for compiler flag optimization

11.3.3 Candidate Machine Learning Approaches

Neural networks

The methods we will be applying are all non-linear techniques, as—with perhaps a few exceptions—based on prior knowledge we anticipate problem spaces with complex, convoluted relationships among variables, and non-linearly separable classes.

Based on past experience with data sets and problems similar in their nature to those expected in the PACE context, neural computation is high on our candidate list. Neural approaches have demonstrably performed better on learning from such data than a number of other, well known, machine learning algorithms [12, 47, 40]. One of us (EM) has been developing neural modules under NASA funding for clustering (unsupervised learning) and classification (supervised learning) of high-dimensional (hyperspectral) imagery, which has similarities with PACE data in dimensionality, considerable number of classes, scarce training samples, complex class/cluster structure. We also have experience with using neural computation for function approximation where the domain has hundreds of dimensions [72, 73].

We want to point out here that by “neural network” we do not imply “Back Propagation Neural Network (BPNN)”. While our models may include, in simple cases, BPNNs, for complex cases we anticipate using more sophisticated and more robust neural architectures that were developed specifically for complicated high-dimensional data as mentioned above.

In a nutshell, these more robust approaches involve learning the structure of the data manifold first, in an unsupervised manner, and storing that knowledge in the form of a neural map (Self-Organizing Map, SOM) and related descriptors derived from the neural map. SOMs are adaptive
11.3. METHODOLOGY

vector quantizers (VQs) that place prototype vectors in the data space for optimal matching of the data distribution. SOMs have a unique property among VQs: they also represent the topology of the input data space by organizing (indexing) the quantization prototypes according to the similarity relations of the data. SOMs mimic the biological neural maps observed in various areas of the cerebral cortex. Neural maps form in response to input stimuli (data) and organize the stimuli on the 2-dimensional surface of the cortex while preserving the topological relations (the manifold structure of the input data). This facilitates fast and precise retrieval of patterns. Since SOM learning is unsupervised, the entire available data set can be used for its training, thus the SOM can form its own view of all available details of the manifold structure. A trained SOM can subsequently be snapped into a feed-forward, fully connected supervised network as its hidden layer, where the output layer is trained using the outputs of the SOM as inputs, and using a simple Delta rule (rather than the complicated BPNN rule). During supervised learning by this “SOM-hybrid network, the hidden SOM layer pre-screens the input patterns by indicating where they belong on the manifold. This helps the output layer learn to categorize the patterns into classes based on training labels - fast and precisely. Applications of this network architecture are described in a number of our previous works. An overview with references is given in [49].

This SOM-hybrid neural network is much easier to train than a BPNN network (is not prone to getting stuck in local minima). It has several other advantages over a BPNN network, as well as over other learning algorithms. The knowledge of the underlying SOM about the manifold structure - which is independent of the users knowledge, provided for supervised training as a set of training labels - makes it resistant to learning inconsistent labels. The topology preserving prototype based representation of the manifold by the SOM hidden layer allows good quality supervised learning from smaller number of training labels, and enables very precise discrimination of classes whose feature vectors may have slight but meaningful differences for the given application. The price to pay for all these advantages is in the training and interpretation of a high quality SOM. Issues relevant to this have been studied extensively, and tools developed, by the Merényi group. A recent book chapter summarizing related details is [52].

Neural networks have special significance in classification of feature vectors with disparate variables because (supervised) neural networks may be the only way to automatically derive - as part of the learning - appropriate scaling for the mixed types of variables.

Neural computing has also been used, successfully, for assessing the relative importance of the input features, for the purpose of eliminating non-contributing features. While this is quite difficult to do with traditional statistical approaches [12] some supervised neural network methods can naturally produce the weighting based on the training samples. Further, learning of the relevances of the features can be done in a joint optimization for a given classification goal [38, 46].

However, the extremely high number of flag settings (output features, or classes), for example, exceeds previous experience, presents “firsts” and unknowns, which make compiler optimization and uncharted territory, to be approached with cautious optimism and with a commitment to further research.

**Genetic algorithms**

Genetic Algorithms are the natural choice for some PACE tasks, as earlier work by PACE investigators [27] (and other groups) demonstrated. In particular, the task of finding good order of (good) compiler flag settings involves (potentially) variable length feature vectors, which would be handled poorly by most other machine learners. Genetic Algorithms could also be used to do a fine-grained search in the vicinity of a solution proposed by a different ML algorithm (on a finer grid than the training data for the other ML algorithm was generated), to explore whether significantly better solution may exist. The drawback of Genetic Algorithms is, however, that they do not have a memory (do not build a model), therefore they cannot exploit experience gained from past in-
stances (they have to evaluate each instance through a new search).

Other possibilities

Markov Models and Hidden Markov Models have already been applied successfully by us (LC and KP) in earlier research for the prediction of pre-fetching [44]. This software is already part of our arsenal and will be used in the Run-Time subsystem of PACE (box MLxx in Figures 11.1 and 11.3).

11.3.4 Productivity metric for Machine Learning

The performance of Machine Learning for PACE should be assessed on two levels. On the higher level, the overall effectiveness of ML will be measured by the quality of advice the ML models will give to the various subsystems (as shown in Figure 11.1) for their decision making. The exact way (or ways) to best characterize this effectiveness will be determined in the course of the development of the PACE system. However, the metric of overall effectiveness should be some combination of (1) the improvement in program performance and (2) decrease in the time needed to achieve the optimization. The ingredients for creating a meaningful metric, in any case, will come from run-time recording of performance improvements in program executions (or in optimization effort) a result of ML advice to the PACE compiler. Below we discuss some details of how these measurements can be done.

Quantifying the improvement in program performance

First, we discuss how the improvement in program performance may be quantified. This can be characterized with two different baselines: (a) the performance of the optimized program (running time, code size, memory footprint etc.) with the performance of the unoptimized program as the baseline. (b) the performance of the optimized program with the best possible performance of the program as the baseline. The improvement of the performance of the optimized program over the unoptimized program can be quantified in a relatively straightforward manner by measuring the performance of the unoptimized and optimized versions of the program. Since the performance of the unoptimized program would have been measured in any case to help drive decision making and adaptation in the various subsystems of the PACE compiler, we do not expect this comparison to incur significant additional resources (time, instrumentation effort etc).

When the performance of the optimized program is compared with the best possible performance of the program as the baseline, it characterizes the amount of optimization opportunity discovered by the ML subsystem. For small program regions and for a small set of optimizations, the baseline can be determined by searching the set of all possible optimization decisions. However, for most practical scenarios involving large programs and a large set of possible optimizations, determining the baseline could prove difficult. In this case, several alternative strategies may be adopted such as

1. Comparing the decisions of the ML engine with those of a human expert. This has several advantages—in particular, not only can the performance of the human expert-optimized and ML-optimized programs be compared, but the nature of decisions such as the flags that are set by the human expert and the ML engine, could yield valuable insights for the human expert as well as for the design of the ML models.

2. Generating synthetic program regions with known optimal (and therefore known baseline) performance. For example, a program with a synthetically generated sequence of instructions whose critical path length and instruction latencies are known, may be used to study the effectiveness of an instruction scheduler.

3. Using search strategies which either build on the ML decisions or are independent of the ML decisions to determine if program performance can be improved dramatically. For example,
genetic algorithms could be used to search the neighborhood of the decisions made by a different ML approach to determine if better solutions exist

Quantifying the decrease in time needed to achieve optimizations

The decrease in time needed to achieve optimization can be quantified under two categories. The first is the reduction in the time needed to perform optimization decisions when the time needed for non-ML (but automated) approach is the baseline. For example, the time taken for the task of determining good tile sizes by (the non-ML approach of) searching, can be compared to the time taken by a trained ML engine to perform the same task. The second is the reduction in time needed when the time needed by a human expert to adapt the compiler to a new computer system and/or to perform optimization decisions is taken as the baseline. In both these comparisons, the time needed for the initial one-time training of the ML engine should be considered in some amortized manner.

Before evaluating the overall effectiveness of ML models for PACE, we must, however, measure their performance on a lower level first. The purpose of this is to ensure that the ML engines are well trained for the particular tasks, with the available data. As with any machine learning technique, a supervised model’s prediction capability needs to be assessed by both a) verification of the learning success on the training data; and b) evaluation of the prediction accuracy on test data (data not used for the training of the model but drawn from the same data distribution), as described in §2.1. Moreover, the reliability (consistency) of the model must be evaluated by building a number of separate models through “jackknifing” (or cross-validation). This means that training and test data sets for each model are obtained through repeated shuffling of the available labeled data and splitting randomly into training and test data sets. The variance of the prediction of the resulting models on the respective test data sets should be small for the result to be credible. Only when trained models are found excellent in their generalization capability (i.e., in their prediction on test data) can one assume that an ML technique is providing advice based on what it derived, by learning, about the relationship between input and output variables. Consequently, only in this case can we attribute any observed improvement in program or compiler performance to Machine Learning.

11.3.5 Infrastructure

One of us (EM) has been developing neural learning modules under funding from the Applied Information Systems Research program of NASA’s Science Mission Directorate for clustering and classification of high-dimensional (hyperspectral) data, which have similarities with PACE data in dimensionality, large number of classes, scarce training samples, complex class/cluster structure. These learning and data visualization engines have been used to support science investigations in earth and space science, as well as in medicine [50, 60, 33]. The software developed for these applications will be modified and augmented appropriately to interface with PACE data and used for experiments implementing the machine learning tasks outlined in §11.2.2. We need to jump an initial hurdle of software migration, from Sparc / Solaris to X86 / Linux environment, forced by Sun Microsystem’s phasing out of the Sparc architecture. While it is a non-trivial exercise, we soon will have a full time professional Research Programmer Analyst on board (interviews are ongoing) to help accomplish this task, as well as the necessary further developments. Experiments to develop learning engines and learned models will be conducted in our existing Sparc / Solaris environment, parallel to the migration effort.
11.4 Conclusions

The machine learning component of PACE is in the design phase, lagging behind the other components by about 6 months on purpose to allow proper connection to already developed ideas and structures by the Resource Characterization, Platform Aware Optimization, and Run-Time groups. In consultation with these groups, we have defined an abstract framework for the connection points, input and output variables (the feature space) and the types of learning engines for machine learning tasks that are most likely to benefit the PACE system.

We are close to finishing the design of training data and data collection plan for the first problem we want to target with machine learning, the tile size optimization to be used in the PAO (§ 11.2.2). We anticipate starting learning experiments in the summer of 2010.

Machine learning for compiler optimization is in its infancy, with much unexplored potential - and potentially with some hard surprises that could require development of new approaches. From what we have researched, combined with our previous experience, we are cautiously optimistic that we will be able to develop several effective machine learning components for the PACE system.
Appendix A
Automatic Vectorization in the PACE Compiler

The Platform-Aware Optimizer (PAO) analyzes loops for their vectorizability. If the PAO determines that a loop is vectorizable, it marks this loop as such and performs analysis and transformations to enable vectorization. If the target supports short vector instructions (for example Altivec, VMX, or SSE) and either an LLVM backend producing native code or the C backend is used, the Rose to LLVM translator transfers this information to LLVM IR. The vectorization pass in the TAO uses the analysis information supplied by the PAO to replace scalar LLVM operations by LLVM vector operations, where a cost model determines if it is beneficial to do so. This document describes the interfaces between the components involved and the TAO pass responsible for vectorization.

A.1 Overview

Vectorization in the PACE compiler is performed as a cooperative effort between the PAO and the TAO. The PAO analyzes whether loops are amenable for vectorization, optionally performs transformations to enable vectorization, and performs several analysis passes on the code to generate information that is needed by the TAO’s vectorization pass. The TAO’s vectorization pass requires the following types of information:

- **Alias information** describes which array or pointer names may alias the same memory location.
- **Alignment information** describes which memory accesses (loads, stores) are aligned with respect to the vector unit alignment requirement.
- **Bundles** describe memory accesses to consecutive memory locations, produced by unrolling the loop.
- **Memory dependence information** describes dependences between loads and stores. The PAO builds a dependence graph.

The PAO annotates SAGE III IR with pointers to the above information data structures. For example, when the PAO performs the dependence analysis pass, it builds a dependence graph data structure and annotates each SAGE III IR array load with a pointer to the node representing that load in the dependence graph. The Rose to LLVM translator takes the annotated SAGE III IR as input.
APPENDIX A. AUTOMATIC VECTORIZATION IN THE PACE COMPILER

Figure A.1: Vectorization interplay between PAO, Rose-to-LLVM Translator, and TAO

Figure A.1: Vectorization interplay between PAO, Rose-to-LLVM Translator, and TAO

and translates the annotated SAGE III IR to LLVM IR, transforming the pointer information into LLVM metadata. The vectorization pass in the TAO uses alias, alignment, dependence, and bundle information and transforms scalar LLVM instructions to vector LLVM instructions by performing a cost model-based instruction selection algorithm that chooses between scalar and vector instructions.

Figure A.1 illustrates the interplay between the three components.

A.2 Functionality

The PAO recognizes loops that are amenable to vectorization, either because they have statements that can be unrolled and then replaced by vector instructions, or because the loop body contains instructions that can be vectorized. The following text refers to the component in the PAO that recognizes vectorizable loops as the PAO vectorizer. The PAO vectorizer marks vectorizable loops as such in the SAGE III IR.

The vectorization pass in the TAO operates on straight line code. The PAO vectorizer tries to generate a longer block of straight line code using the PAO’s loop unrolling component. The loop unrolling component marks accesses to consecutive memory locations, which have been replicated by unrolling, as bundles. For example consider the statement $a[i] =$ in a loop body, which, after the loop unroller has unrolled it one time, results in two memory accesses $a[i] =$ and $a[i+1] =$. The loop unroller stores the two stores as a bundle. The PAO uses the data dependence analyzer to build the memory dependence graph among memory accesses in the loop body.

After unrolling, the PAO performs alias and alignment analysis, and annotates the SAGE III IR with links to the analysis data (including the dependence graph and bundles).

The PAO vectorizer may perform several loop optimizations to enable vectorization. For example, it can perform loop peeling to align memory accesses with respect to the vector unit alignment requirement. It may perform loop fusion to enable more opportunities for vectorization, or it may perform loop interchange to bring the loop carried dependence to the innermost loop. By moving a loop carried dependence to the innermost loop and unrolling the loop, the dependence with respect to the vector unit length becomes loop independent. The straight line code vectorization might then find enough opportunities to vectorize the code, so that vectorization is still profitable (see figure A.2 for an example).

The process of vectorization in the PAO is illustrated in figure A.3.

During translation of SAGE III IR to LLVM IR, the Rose to LLVM translator converts the pointers to the analysis information in the SAGE III IR to links attached as metadata to LLVM’s instructions. It embeds the alignment information directly in metadata instead of accessing it through a link.

Next, the TAO runs its vectorization pass, which examines the loops marked as vectorizable and performs straight line vectorization on them. The vectorization pass uses the analysis information provided by the PAO to guide the instruction selection algorithm that replaces some scalar LLVM
for(i in 4 ... 4000)
for(y in 0 ... 4000)
a[y][i] = d[i] * c
    = a[y][i+1] + a[y][i];

(\pao{} interchanges and unrolls) =>

for(y in 0 ... 4000)
for(i in 4...4000, +2)
a[y][i] = d[i] * c
c[y][i] = a[y][i+1] + a[y][i];
a[y][i+1]= d[i+1] * c
c[y][i+1]= a[y][i+2] + a[y][i+1];

(TAO selects vector instructions) =>

for(y in 0 ... 4000)
for(i in 4 .... 4000, +2)
tmp:0:1 = d[i:0:1] * c:0:1
tmp2 = a[y][i+1]
tmp3 = a[y][i+2]
tmp4:0:1 = pack [tmp2:tmp3]
a[y][i:0:1]= tmp:0:1
c[y][i:0:1] = tmp4:0:1 + tmp:0:1

---

Figure A.2: Loop interchange to enable vectorization

instructions by vector instructions. It incorporates instruction cost information provided by the RC. In addition to costs for scalar operations, the cost of vector instructions is also needed.

It is impossible to measure the cost of various vector operations from portable C programs and at this point in compilation, it is known that the PACE compiler uses LLVM with either a native or the C backend. So RC measures the cost of vector instructions through LLVM. For example, RC measures the cost of doing a vector fadd by timing LLVM IR. At a later time the vectorization algorithm in the TAO may also incorporate the number of available registers to guide the instruction selection process.

The following two sections describe the input to the TAO’s vectorization pass and the output it generates.

A.2.1 Input

The vectorization pass in the TAO uses the following inputs to perform replacement of scalar operations by vector operations in straight line code.

Scalar LLVM code The TAO accesses the PAO analysis information as LLVM IR with LLVM metadata that encodes the analysis information or contains links to the analysis data structures.

Vectorizable LLVM IR loops The PAO marks an innermost loop as vectorizable by annotating the SAGE III IR AST node representing the back edge of the loop. The Rose to LLVM translator transfers this annotation to the corresponding LLVM IR jump instruction using the metadata tag “!vec !1”.

APPENDIX A. AUTOMATIC VECTORIZATION IN THE PACE COMPILER

Figure A.3: Vectorization in the PAO

!1 = metadata ! { i1 true } ; ...
for.inc10:
  %incr15 = add i32 %i.0, 1
  br label %for.cond8, !vec !1 ; back edge

The vectorization pass in the TAO uses this information to guide which code blocks it should vectorize.

**Alias information** The PAO performs alias analysis and stores the sets of aliased variables. It annotates the loads and stores in SAGE III IR involving those variables with pointers to these sets. The Rose to LLVM translator enables the TAO to access those sets through LLVM metadata links. If the set only contains one variable name, a null link is stored in the metadata instead of a link to the set. The metadata tag used is !aliasinfo.

!2 = metadata ! { i64 0x0 } ; Singleton set ; ...
%elemaddr= getelementptr [2000 x float]* %array, i32 0, i32 %i
%val = load float* %elemaddr, !aligned !1, ...
%elem2addr= getelementptr [2000 x float]* %array, i32 0, i32 %iplusone
%val2 = load float* %elemaddr, !aligned !0, ...

**Alignment information** Vector instructions involving memory accesses on certain architectures require the access to be aligned at a specific byte boundary. For example, SSE requires memory loads and stores to be 16 byte aligned. Otherwise the programmer must use more expensive unaligned memory move instructions. The PAO tries to generate loops that contain mostly aligned memory accesses, for example, by peeling a loop. The PAO annotates memory accesses (loads, stores), whether they are aligned or not. The TAO accesses this information as LLVM metadata by using the tag !aligned.

!0 = metadata ! { i1 false }
!1 = metadata ! { i1 true }
; Code: ... = a[i]; ... = a[i+i];
%elemaddr= getelementptr [2000 x float]* %array, i32 0, i32 %i
%val = load float* %elemaddr, !aligned !1, ...
%elem2addr= getelementptr [2000 x float]* %array, i32 0, i32 %iplusone
%val2 = load float* %elemaddr, !aligned !0, ...
A.2. FUNCTIONALITY

Memory dependence analysis  The PAO generates a dependence graph for memory accesses in the loop. It annotates the SAGE III IR memory accesses with pointers to the nodes in the dependence graph that represent these memory accesses. The Rose to LLVM translator provides access to this information through LLVM metadata by using the tag !dep.

%; Oxaffe is the address of the dependency graph node
!3 = metadata ! { i64 Oxaffe }
;
%elemaddr= getelementptr [2000 x float]* %array, i32 0, i32 %i
%val = load float* %elemaddr, !dep !3, ...

Bundles  When the PAO unrolls a loop it replicates array accesses. For every array access that it replicates to a contiguous memory location, it builds a tuple that contains all the replicated array accesses. For example, if a for loop contains read accesses to a[i] and the loop is unrolled four times, the PAO builds a bundle data structure that contains the load of (a[i], a[i+1], ..., a[i+3]). Bundles are tuples, so the position is significant. The PAO annotates memory access nodes in SAGE III IR with pointers to their corresponding bundle data structure. The Rose to LLVM translator provides access to the pointers through the LLVM metadata tag !bun. The metadata associated with the tag contains not only the pointer but also the index in the bundle tuple. Bundles simplify finding consecutive memory accesses during the vectorization pass in the TAO.

!4 = metadata ! { i64 0xf00b, i32 0 } ; 0xf00b is the pointer to the bundle
!5 = metadata ! { i64 0xf00b, i32 1 } ; 1 is the position in the tuple
;
; Code: ... = a[i]; ... = a[i+1];
%elemaddr= getelementptr [2000 x float]* %array, i32 0, i32 %i
%val = load float* %elemaddr, !bun !4, ...
%elem2addr= getelementptr [2000 x float]* %array, i32 0, i32 %iplusone
%val2 = load float* %elemaddr, !bun !5, ...

Resource characterization information  The vectorization algorithm needs the cost of scalar and vector instructions as input to perform instruction selection. It also needs the width of the vector unit to generate vector types of the right size. For example, if the vector length is 128 bits, the vectorization path will try to replace scalar double instructions by instructions of the vector <2 x double> type.

A.2.2 Output

The vectorization pass replaces scalar instructions by vector instructions if the cost analysis has determined it is beneficial to do so.

a0 = load double* %a1ptr, i32 %i, !bun !4, !aligned !1,
a1 = load double* %a1ptr, i32 %iplus1, !bun !4, !aligned !1, ...
b0 = fadd double %a0, %val1
b1 = fadd double %a0, %val2

The TAO vectorization pass would replace the previous code by the following vectorized version.

%valvec.0 = insertelement <2 x double> zeroinitializer, double %val1, i32 0
%valvec = insertelement <2 x double> %valvec.0, double %val2, i32 1
a0 = load <2 x double>* %a1ptr, i32 %i, align 16, !bun !4, !aligned !1,
b0 = fadd <2 x double> %a0, %valvec

Note that the pass put the two scalar values %val1, %val2 into a vector register and that it has annotated the memory load with the LLVM align specification. That alignment specification is
necessary so that a native backend will emit an aligned memory move instead of an unaligned one, resulting in better performance.

A.3 Method

Generating good quality vector code for a straight-line piece of IR fragment is paramount to the performance of the program in processors that support a short vector SIMD unit. As stated in prior work, the process of vector code generation can either be easy or cumbersome. As an example of the former, the compiler can find consecutive memory operations, combine them, and subsequently combine their dependent statements until no more instructions can be combined. As an example of the latter, the compiler can use depth-first search, backtracking, and branch-and-bound techniques to find the best possible way of combining operations to generate vector code. In this document we propose a different approach to automatic vector code generation that is driven by a cost model. The cost model guides the vector code generation steps and prunes many search paths that would lead to suboptimal vector code generation. The cost model is combined with a dynamic programming technique to evaluate the best cost for a sequence of IR instructions.

A.3.1 Dynamic Programming

Each TAO IR instruction has an associated cost\(^1\). As stated earlier in this chapter, the dependence information is readily available from PAO. Using this dependence information, we build a dependence graph at the TAO IR instruction level. A dependence node is an IR instruction and a dependence edge \(a \rightarrow b\) implies that \(b\) is dependent on \(a\). Such a dependence graph is made single sink by adding synthetic nodes as needed. We propose to use two cost metrics: (1) \(scost\): cost of evaluating an operation in scalar fashion; (2) \(vcost\) is the cost of evaluating some operations in a vector fashion – the number of such operations can be determined by the vector length of the underlying machine\(^2\).

Our proposed algorithm starts a bottom-up pass of the dependence graph to compute the costs of evaluating various operations in the dependence graph in both scalar and vector fashion, choosing the minimum cost along the traversal path. The overall cost of the sink node denotes the overall cost of generating vector code. A second top-down pass of the dependence graph is made to identify those operations that need to be evaluated in scalar fashion and those operations that need to be evaluated in vector fashion. Finally, the vector code for the dependence graph is automatically generated by making a bottom-up pass.

The above algorithm needs to pay special attention to dependence graphs that are DAGs rather than trees. Several approaches have been proposed in the literature to break a DAG into trees and then compute the overall cost of each tree. We can easily adapt our proposed method to such scenarios.

The complexity of the above algorithm is bounded by three passes over the dependence graph.

\(^1\)The cost of each IR instruction is computed relative to the cost of an integer-add IR operation.

\(^2\)RC provides such information to TAO.
Bibliography


