Princeton University

School of Engineering & Applied Science

Analysis and Optimization Techniques for Massively Parallel Processors

Wenhao Jia
Engineering Quadrangle J401
Wednesday, October 8, 2014 - 3:00pm to 4:30pm

Heterogeneous parallelism has emerged as a widespread computing paradigm in the past decade or so. In particular, massively parallel processors such as graphics processing units (GPUs) have become the prevalent throughput computing element in heterogeneous systems. However, GPUs are difficult to program and design, primarily due to developers' general unfamiliarity with GPUs, a lack of hardware resource virtualization, and inadequate tailoring of conventionally designed components--such as general-purpose caches--to the massively parallel nature of GPU workloads.
 To overcome these challenges, this thesis proposes statistical analysis techniques and software and hardware optimizations that improve the performance, power efficiency, and programmability of GPUs. These proposals make it easier for programmers and designers to produce optimized GPU software and hardware designs.
 The first part of the thesis describes how statistical analysis can help users explore a GPU software or hardware design space with performance or power as the metric of interest. In particular, two fully automated tools--linear regression-based Stargazer and regression tree-based Starchart--are developed and presented. They help identify important design parameters and automate design decisions, while saving design exploration time by 300-3000 times compared to exhaustive approaches.
 Then, the second part of the thesis proposes two compile-time algorithms to automatically make two key GPU software design decisions: cache configuration and thread block size selection. The first algorithm uses instruction-level memory access pattern analysis and cache control to improve the performance benefits of caching from 5.8% to 18%. The second algorithm runs programs with thread counts estimated to trigger performance saturation, reducing GPU core resource usage by 27-62\% while improving performance by 5-10%.
 Finally, the third part of the thesis proposes and evaluates the memory request prioritization buffer (MRPB). MRPB uses request reordering to reduce cache thrashing and uses cache bypassing to reduce resource stalls, improving performance by 1.3-2.7 times and easing GPU programming.
 In summary, using GPUs as an example, the high-level statistical tools and the more focused software and hardware studies presented in this thesis demonstrate how to use automation techniques to effectively improve the performance, power efficiency, and programmability of emerging heterogeneous computing platforms.