Software-based COMA

Research Members: Sujoy Basu & Josep Torrellas.

The goal of this project is to propose a lower-cost alternative to the hardware-intensive COMA machines. The approach we take is to use a page as the allocation unit in main memory. This allows us to avoid designing main memory as a hardware cache. The earliest known work, which takes this approach, is the Simple COMA project.

The main problem with this approach is that bringing data into local memory involves allocating space for the entire page containing that data. If the application has poor spatial locality, this leads to memory fragmentation. Consequently the page fault frequency increases, and the performance of the application suffers. We have a solution which reduces memory fragmentation and cuts down on the frequency of page faults. This work is in progress.

Reliability is a major concern when designing any large-scale shared-memory multiprocessor. We have made our design fault-tolerant by protecting pages in memory with their own firewall. It prevents wild writes originating in a faulty node from corrupting the memory of another node. Isolating faults in this manner ensures that only applications making use of the faulty node are affected by it. This work is in progress.