Cost Performance Comparison of NUMA x COMA

Research Members: Zheng Zhang (graduated) , Marcelo Cintra and Josep Torrellas.
The main objective of this study is to develop a deep understanding of the cost-performance tradeoffs of the two main architectural alternatives for scalable-shared-memory multiprocessors: NUMA (Non-Uniform Memory Architecture) and COMA (Cache-Only Memory Architecture). Understanding the behavior of both architectures for different types of applications can lead to improvements on both architectures and to the development of novel architectural alternatives.
To understand the differences in performance between the NUMA-RC and COMA organizations, we introduce a simple model of the effect of memory accesses. We first classify the memory accesses of an application into local, remote-cold, remote-coh, and remote-conf. Local accesses are those satisfied by the local memory hierarchy, including the primary and secondary caches, local attraction memory in COMA, and the remote cache and local memory in NUMA-RC. All the rest of accesses are remote ones, and can be classified into Remote-cold, Remote-coh and Remote-conf, for accesses that miss in the local memory hierarchy due to the processor accessing a memory line for the processor's first time, due to data sharing and overflow in the local memory hierarchy, respectively. For a given application, the total number of accesses is the same in COMA and in NUMA-RC. Furthermore, since remote-cold and remote-coh accesses are largely intrinsic to the application, they do not change across architectures either. The number of remote-conf accesses, however, varies across architectures. Consequently, if one of the architectures has fewer remote-conf accesses than the other, it will be at the expense of more local accesses, and vice-versa. Clearly, we want to minimize the number of remote-conf accesses because they are much more expensive than local accesses.
A general qualitative comparison between NUMA-RC and COMA can be based on three main cost-performance metrics:

Number of Remote Conflict Misses
Processor Stall Time per Access
Design Complexity

The figure below shows a qualitative comparison of both architectures along these three axis.