Princeton University

School of Engineering & Applied Science

Achitectural Support for Large-scale Shared Memory Systems

Yaosheng Fu
Engineering Quadrangle J401
Tuesday, September 12, 2017 - 3:00pm to 4:30pm

Modern CPUs, GPUs, and data centers are being built with more and more cores. Many popular workloads will require even more hardware parallelism in the future. Shared memory is a popular parallel programming model with many advantages, but it is historically difficult to scale to a large number of cores/nodes.
This thesis focuses on improving two key challenges of large-scale shared memory systems: scalability and fault-tolerance. In order to solve those challenges, this thesis first develops a parallel simulator named PriME to simulate shared memory systems at scale. Then it introduces Coherence Domain Restriction (CDR) as a cache coherence framework that sidesteps traditional scalability challenges and enables systems to scale to thousands of cores within a manycore chip or millions of cores across the entire data center. For fault-tolerance, this thesis has developed both a software-centric solution with resilient memory operations (REMO) and a hardware-centric solution with a fault-tolerant cache coherence framework (FTCC). In sum, this thesis demonstrates that shared memory systems have the potential to achieve comparable scalability and fault-tolerance ability as current cluster-based designs while still maintaining other benefits such as ease of programming and efficient memory accesses.