Princeton University

School of Engineering & Applied Science

Achitectural Support for Large-scale Shared Memory Systems

Speaker: 
Yaosheng Fu
Location: 
Engineering Quadrangle J401
Date/Time: 
Tuesday, September 12, 2017 - 3:00pm to 4:30pm

Abstract
Modern CPUs, GPUs, and data centers are being built with more and more cores. Many popular workloads will require even more hardware parallelism in the future. Shared memory is a popular parallel programming model with many advantages, but it is historically difficult to scale to a large number of cores/nodes.
 
This thesis focuses on improving two key challenges of large-scale shared memory systems: scalability and fault-tolerance. In order to solve those challenges, this thesis first develops a parallel simulator named PriME to simulate shared memory systems at scale. Then it introduces Coherence Domain Restriction (CDR) as a cache coherence framework that sidesteps traditional scalability challenges and enables systems to scale to thousands of cores within a manycore chip or millions of cores across the entire data center. For fault-tolerance, this thesis has developed both a software-centric solution with resilient memory operations (REMO) and a hardware-centric solution with a fault-tolerant cache coherence framework (FTCC). In sum, this thesis demonstrates that shared memory systems have the potential to achieve comparable scalability and fault-tolerance ability as current cluster-based designs while still maintaining other benefits such as ease of programming and efficient memory accesses.