(OSDI '20) Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache

2020/11/20 System

paper notes about XStore[1] from SJTU, which uses learned cache (aka learned index[2]) for distributed DRAM KVS.

backgroud

  • 7 RTT for 100M KVS with tree index -> slow
  • big index tables do bad to caching
    • high miss rate
    • cost clients’ mem

learned index: a machine learning approach to approximate the cumulative distribution function (CDF) of the keys in the tree index, which save searching cost in tree index and parameters storage cost. image.png

XStore

process:

  • update, insert…: server centric -> two-side RPC
  • get: client centric -> O(1) with learned cache

learned cache:

  • like learned index
  • index traversal -> cal
  • smaller index model than index tree, which is good for client caching

learned index works for a sorted & static array. (range query)

GET steps:

  1. key => model => a guaranteed space (value addr)
  2. clients fetch all keys in this space, and get the true key.
  3. another READ for the corresponding value

image.png

benefits:

  • 1 RTT for lookup
  • small memory cost

BUT index trees are not static!

server side

server side maintain a hybrid structure: B+Tree + Model

  • virtual leaf nodes for insert
  • key&value are stored seperately but continuously. Read value by offset from its key
  • logical addr in leaf nodes are not sorted (just sorted logically)
  • a 2 level XModel (NN+LR) predicts a position sorted range from a key.
  • insertion make array unsorted: a translation table(TT) maps logical addr to the phsical, which would be cached partially in clients
  • model use RMI[2] to reduce re-train costs.

image.png

model

  • top level assign keys to sub-models
  • sub-models predict positions with min& max error. A sub-model owns a TT and a logical space.
  • due to small sub-model size and simplicity(LR), training is fast
    • 8μs for retraining
    • 8s for training the entrie cache

Index Tree

XTree use an HTM-based concurrent B+tree[3][4] to ensure atomicity with high perf

I konw little about HTM, so u better go to read 5.3 in the paper…

UPDATE:

  • traverse XTree, and update value inplace. so it will not change index.
  • optimaztion: update msgs carry position predicted by client’s learned cache to reduce server’s load

INSERT & DELETE:

  • normal B+tree moves kv pairs to keep them sorted.
  • XTree doesn’t keep it within a leaf node. Since READ would read a whole leaf node, it won’t effect READ.
  • set flags and rewrite data for DELETE to avoid changing index
  • when node split, increment incarnation (kind of some flags to indicate inconsistency), and retrain model

Based ops above, a stale TT can still maps correctly as long as accesses are not overlapped with split nodes.

Optimization trick:

  • speculative execution. Clients can try to read data from stale TT entry, since half of old data still reside in this node. If miss, then read data from right-link. Good for insert-dominated workloads.
  • Model expansion from XIndex[5] to handle growing kvs size.

scale out

refer to [6]. Also need to use boundary key augmentation.
Clients keep one learned cache for each server. Use a particularly partitioning func to choose server.

When a SCAN across many servers:

  • collecting nodes across different servers, serially?
  • one RDMA_READ for each server

read can also use that partitioning func? i don’t get it

client side

clients need to cache 2 components

  1. model
  2. the table (most of space)

Operation:

  • GET
    • use XModel to predict a position range (mostly a single leaf node with $N$ slots)
    • get real addr by TT
    • read keys by one RDMA_READ with RDMA doorbell batching
    • scan and identify keys is valid or existent, then read value by another RDMA_READ.
    • if keys or TT entries are invalid, start to update local model and TT
  • Scan
    • look up related leaf nodes to determine the remote address
    • read related leaf nodes with both keys and values by RDMA_READ

image.png

learned index use monotonic models to identify existence.
So non-existent keys would cause some problems:
  model can’t predict untrained area. e.g LR0 can’t handle LR0(10)

solution: “data augmentation”:
  train models with a boundary key, so model can predict correct boundary. e.g. LR1(19)=>(18,20)

not the typical data augmentation….

eval

  • model size: 500K LR(14B*500K)
  • senstive to workload’s pattern (complex↑ -> ↓)
  • compared with tree-index caching, cache size 600MB->150MB
  • if use 1% cache size -> only 20% perf loss

image.png

benefits from no server CPU bottleneck and reduced network cost. Bigger advantage when uniform workloads (hot data is easy to cache, and the real workload..

limitation

  • fixed key-lens
    • use hash to encode variable-length keys

      just like fatcache

  • model selection: LR is not so good, though good for retraining
    • complex data distribution

      just trade off between acc&cost

refer

  1. Wei, Xingda, Rong Chen, and Haibo Chen. “Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache.” 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 2020.
  2. Kraska, Tim, et al. “The case for learned index structures.” Proceedings of the 2018 International Conference on Management of Data. 2018.
  3. Wang, Zhaoguo, et al. “Using restricted transactional memory to build a scalable in-memory database.” Proceedings of the Ninth European Conference on Computer Systems. 2014.
  4. 硬件事务内存(Hardware Transactional Memory, HTM)

    硬件事务内存系统一般是在每个处理器核心中增加事务cache,同时在CPU指令集中增加事务相关的指令。在执行事务的过程中,事务访问过的数据会缓存在事务cache中,修改过的cache一致性协议会检测各事务之间的冲突。如果产生冲突,对应的事务cache中修改过的数据全部作废,该事务重新执行;如果没有冲突产生,就在事务结束时把事务cache中的数据更新到内存中。
    硬件事务内存系统完全由硬件电路实现,所以其性能相较于软件事务内存系统有很大的优势,但是由于硬件资源的限制,所以硬件事务内存无法处理大型事务(在处理大事务的过程中会产生大量的共享数据修改操作,这些修改记录的大小超过处理器内事务cache的大小)。为了保证事务的完全执行,现有的商用硬件事务内存系统会加入串行部分,也就是“锁”,对于一部分硬件事务内存实在无法完成的事务就用串行部分完成,这虽然损失了部分性能,但至少保证了功能的完整性。
    在具体实现时有2个要点: 1. 使用处理器内部的事务cache来存放事务执行过程中更新的数据,这一类的经典代表:TCC 2. 使用cache一致性协议来检测事务之间共享数据访问的冲突,这一类的经典代表:LogTM.
    from https://zhuanlan.zhihu.com/p/151425608

  5. Tang, Chuzhe, et al. “XIndex: a scalable learned index for multicore data storage.” Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2020.
  6. Ziegler, Tobias, et al. “Designing distributed tree-based index structures for fast RDMA-capable networks.” Proceedings of the 2019 International Conference on Management of Data. 2019.

Search

    Table of Contents