Exploring the Optimization Space of Dense Linear Algebra Kernels Qing Yi1 1
Apan Qasem2
University of Texas at San Anstonio 2 Texas State University
Abstract. Dense linear algebra kernels such as matrix multiplication have been used as benchmarks to evaluate the effectiveness of many automated compiler optimizations. However, few studies have looked at collectively applying the transformations and parameterizing them for external search. In this paper, we take a detailed look at the optimization space of three dense linear algebra kernels. We use a transformation scripting language (POET) to implement each kernel-level optimization as applied by ATLAS. We then extensively parameterize these optimizations from the perspective of a general-purpose compiler and use a standalone empirical search engine to explore the optimization space using several different search strategies. Our exploration of the search space reveals key interaction among several transformations that must be considered by compilers to approach the level of efficiency obtained through manual tuning of kernels.
1
Introduction
Compiler optimizations have often targeted the performance of linear algebra kernels such as matrix multiplication. While issues involved in optimizing these kernels have been extensively studied, difficulties remain in terms of effectively combining and parameterizing the necessary set of optimizations to consistently achieve the level of portable high performance as achieved manually by computational specialists through low-level C or assembly programming. As a result, user applications must invoke high-performance domain-specific libraries such as ATLAS [10] to achieve a level of satisfactory efficiency. ATLAS produces extremely efficient linear algebra kernels through a combination of domain/kernel specific optimization, hand-tuned assembly, and an automated empirical tuning system that uses direct timing to select the best implementations for various performance-critical kernels. Libraries such as ATLAS are necessary because native compilers can rarely provide a similar level of efficiency, either because they lack domain-specific knowledge about the input application or because they cannot fully address the massive complexity of modern architectures. To improve the effectiveness of conventional compilers, many empirical tuning systems have been developed in recent years [1, 4, 8, 9, 11, 14]. These systems have demonstrated that search-based tuning can significantly improve the efficacy of many compiler optimizations. However, most research in this area has
focused on individual or a relatively small set of optimizations. Few have collectively parameterized an extensive set of optimizations and investigated the interactions among them. One impediment to parameterizing a large class of transformations is that, as yet, we have no standard representation for parameterized optimizations and their search spaces. Because of this, most tuning systems consist of highly specialized code optimizers, performance evaluators and search engines. Hence, exploring a larger search space comes with the burden of implementing additional parameterized transformations. This extra overhead has, to some degree, limited the size of the search space investigated by any one tuning system. In this paper, we describe a system in which we interface a parameterized code transformation engine with an independent search engine. We show that by leveraging the complementary strengths of these two modules we are able to explore the search space of a large collection of transformations. While most compilers understand well the collection of code optimizations that are required to achieve high performance, it is often the details in combining and collectively applying the optimizations that determine the overall efficiency of the optimized code. In our previous work [12], we have used POET, a transformation scripting language, to implement all the kernel-level optimizations as applied by ATLAS for three dense linear algebra kernels: gemm, gemv, and ger. Our previous work has achieved comparable, and sometimes superior, performance to those of the best hand-written assembly within ATLAS [12]. The previous work, however, utilized kernel-specific knowledge obtained from ATLAS when applying the optimizations and when searching for the best-performing kernels. In this paper, we extensively parameterize the optimization spaces from the perspective of a general-purpose compiler. We then use an independent search engine to explore this parameter space to better understand the delicate interplay between transformations. Such interactions must be considered by a compiler when combining and orchestrating different program transformations to approach a similar level of efficiency as achieved by ATLAS. The contributions of this paper include: 1. parameterization of an extensive set of optimizations that need to be considered by a general-purpose compiler to achieve high performance; 2. integration of an independent search engine and a program transformation engine; 3. empirical exploration of the search space that reveals key interaction among optimizations.
2
Related Work
A number of successful empirical tuning systems provide efficient library implementations for important scientific domains, such as those for dense and sparse linear algebra [5, 3], signal processing [6, 7] and tensor contraction [2]. POET [12] targets general-purpose applications beyond those targeted by domain-specific research and complements domain-specific research by providing an efficient transformation engine that can make existing libraries more readily portable to different architectures.
Several general-purpose autotuning tools can iteratively re-configure wellknown optimizations according to performance feedback of the optimized code [1, 4, 8, 9, 11, 14]. Both POET and the parameterized search engine described in this paper can be easily integrated with many of these systems. POET supports existing iterative compilation frameworks by providing an output language for parameterizing code optimizations for empirical tuning. The work by Yotov et al. [13] also studied the optimization space of the gemm kernel in ATLAS. However, their work used the ATLAS code generator to produce optimized code. Because the ATLAS code generator is carefully designed by computational specialists based on both architecture- and kernelspecific knowledge, the optimization space they investigated does not represent the same degrees of freedom that a general-purpose compiler must face. In contrast, we use the POET language to parameterize the different choices typically faced by general-purpose compilers and investigate the impact and interactions of the different optimization choices.
3
Orchestration of Optimizations
We used a transformation scripting language named POET to implement an extensive set of optimizations necessary to achieve the highest level of efficiency for several ATLAS kernels [12]. As shown in Fig. 1, a POET transformation engine includes three components: a language interpreter, a transformation library, and a collection of front-end definitions which specialize the transformation library for different programming languages such Fig. 1. POET transformation engine as C, C++, FORTRAN, or Assembly. The transformation engine takes as input an optimization script from an analyzer (in our case, a developer) and a set of parameter configurations from a search driver. An optimized code is output as the result, which is then empirically tested and measured by the search driver until a satisfactory implementation is found. For more details, see [12]. The optimization scripts that we developed for the ATLAS kernels have been parameterized to reflect the degrees-of-freedom that a model-driven compiler must face when orchestrating general-purpose optimizations. The optimizations focus on the efficient management of registers and arithmetic operations and are invoked by higher-level routines in ATLAS after cache-level optimizations have already been applied by ATLAS. Fig. 2 shows the reference implementations of various ATLAS kernels. The corresponding higher-level routines perform identi-
void ATL_USERMM(int M, void ATL_dgemvT(int M, int N,int K,double alpha, int N,double alpha, const double *A, int lda, const double *A, int lda, const double *B, int ldb, const double *X,int incX, double beta, double *C, double beta,double *Y, int ldc) { int incY) { int i, j, l; int i, j; for (j=0; j