Disaggregated Memory is currently a hot topic in systems research, and distributed large-capacity memory clearly requires system-level reliability strategies. While replication has always been a default choice, with many related works, including recent ones like SWARM@SOSP’24, erasure coding is also an option. This article lists existing EC+DM works.
About offloading erasure coding to NICs.
About offloading memory data movement to DMA or DSA engines.
Prefetch to hide memory access latency (CPU stall) What to prefetch When to prefetch Where to place the prefetched data
In this article, we will list several papers on local NVM/PM fault tolerance.
QoS (LB) on persistent memory systems to avoid interference.
Problem Due to RDMA NIC implementation, RNIC doesn’t have remote persistent flush primitives. So one-sided write data from clients will write to the volatile cache on RNIC first and then RNIC directly sends ACK back before writing data to PM. As a result, a power loss will break remote data persistence easily.
LogECMem uses a hybrid method of in-place update and Parity logging (PL) for parity updates.
learned index + PM. APEX: A High-Performance Learned Index on Persistent Memory[1]
Some industry works about how to utilize DRAM+PM archi as cache (from facebook and twitter).