1. Introduction
The development of high-performance non-volatile memories such as phase change memory (PCM), non-volatile dual in-line memory module (NVDIMM), and 3D XPoint has enabled the use of memory backend storages that are faster than motor driven storages. Among these, a memory bus-connected storage (MBS), such as NVDIMM-N, is directly connected to a memory bus and has the same performance characteristics as those of dynamic random access memory (DRAM). Due to its performance and non-volatile nature, the MBS is becoming an attractive alternative to high-performance server-side storages in human-centric computing [1-3], edge computing, the Internet of Things (IoT) [4], and cloud computing. When these MBSs are installed in non-uniform memory access (NUMA) servers, the access latency varies depending on the distance from the NUMA nodes. For example, Fig. 1 depicts benchmarking results that show the variation in access latency over a 4-node NUMA system, and the difference in throughput when an MBS is installed at node 0.
As shown in Fig. 1(a), it is worth noting that the overall throughput can be improved if access to the MBS is localized. Moreover, in an environment where MBSs are installed in a specific node of a NUMA system, local access is possible only from the node where the MBSs are installed, whereas other nodes should access the MBSs remotely, which may result in poor application performance. However, if MBSs are scattered over different NUMA nodes and the storage space provided by each MBS is carefully managed, the file processing performance can be maximized. Therefore, the efficient management of storage spaces in NUMA servers where MBSs are distributed across different NUMA nodes has posed a challenge with regard to improving the performance of applications running on NUMA servers.
Access latency and throughput over a 4-node NUMA system. (a) Random write on the MBS which is installed at node 0 (block size 4 kB and 128 kB). (b) NUMA idle latencies (in nanosecond) from node {0,1,2,3} to node 0.
Although several research efforts have proposed new non-volatile memory (NVM) file systems and optimization techniques [5-7], few studies have focused on improving file processing performance by considering the use of the MBS in the NUMA environment. Moreover, studies aimed at improving the performance in NUMA systems using DRAMs are not directly applicable to the MBS environment as file data should be persistently stored and because it is difficult to guarantee the consistency of files. Linux currently supports the use of MBSs by mounting them as block devices. In NUMA systems, each MBS installed in a different node can be mounted as a separate block device or a logically one block device using Linux logical volume manager. However, this approach cannot utilize the performance characteristics of the MBS because it is not NUMA-aware and the overheads incurred by using traditional software stacks are considerably heavy.
This paper presents the design and implementation of a high-performance logical volume manager for MBSs called MBS-LVM. MBS-LVM is lightweight in the sense that it removes all unnecessary software stacks and allows applications to directly access MBSs with high performance. It consolidates the address space of each MBS into a single address space and enables the allocation of storage space in a local MBS as much as possible so as to improve the write performance in a NUMA system. The MBS-LVM was implemented in the Linux kernel, and its write performance was evaluated by porting it over the tmpfs, a memory-based file system in Linux. The benchmarking results show that the performance of the modified tmpfs using MBS-LVM is almost twenty times greater than that of the original tmpfs over a NUMA server with four nodes.
The rest of this paper is organized as follows. Section 2 presents the design of the MBS-LVM and discusses various implementation issues. Section 3 compares its performance and provides the benchmarking results and Section 4 presents the conclusion.
2. Design and Implementation
2.1 Design Overview
The MBS-LVM is composed of three components, as shown in Fig. 2, namely the MBS cluster manager, MBS monitor, and MBS space manager. The MBS cluster manager virtualizes the address space of each MBS and combines them as a single address space. The MBS monitor checks whether the available space used by each MBS exceeds the pre-defined threshold or maximum capacity, and determines the target MBS with the lowest access latency among the candidates. The MBS space manager is responsible for allocating storage spaces from the target MBS determined by the MBS monitor. The implementation details of each component are given as follows.
2.2 MBS Cluster Manager
In order to consolidate the address spaces of the MBSs installed in different NUMA nodes into a single address space, the MBSs should be composed as a cluster. From the basic input/output system (BIOS), MBS installation information can be obtained, and the MBS cluster manager combines the MBSs using that information. For this, the sparsemem model technique used in Linux kernel is used. For example, the information of the device installed in a DIMM slot is delivered to the Linux kernel by the BIOS during the booting process, and the kernel can distinguish MBS from DRAM through the e820 map. The Linux kernel manages the contiguously installed DIMMs in a memblock_region structure. To configure the MBS cluster, the installation information of the MBS, such as NVDIMM-N, is stored in a memblock structure through the memblock_region structure. To distinguish the MBS from the conventional DRAM, a new memblock_type and MBS zone are added, as shown in Fig. 3.
MBS zone, MBS cluster, and region for MBS.
2.3 MBS Space Manager
The MBS space manager allocates the storage space from the MBS with the lowest access latency. It was confirmed through various preliminary experiments that the file processing performance can be improved if a local MBS is used as much as possible. The MBS space manager maintains the storage spaces provided by the MBS with the buddy system and allocates the space from the target MBS determined by the MBS monitor.
2.4 MBS Monitor
The MBS monitor periodically checks each NUMA node and classifies them as either FatNode or CandidateNode. CandidateNode is a set of nodes with MBSs that have enough free space, whereas FatNode is a set of nodes with MBSs that do not have free space and thus cannot be selected as a candidate node. CandidateNode is maintained as a sorted linked list where the head of the list represents the MBS node with the lowest access latency. If files are deleted from the MBSs, it is possible that any nodes in the FatNode list can be included in the CandidateNode list. As shown in Fig. 4, this scheme guarantees that the local MBS is the target MBS until it does not have enough free space. When the local MBS is full, the MBS with the lowest latency is chosen as a candidate MBS.
Step for deciding the target MBS. (a) Local allocation and (b) New low access latency node. allocation.
When a write operation is invoked, it asks the MBS monitor to determine a target MBS with the lowest access latency. Then, the MBS monitor returns the target MBS’s ID to the write function, which in turn asks the MBS space manager to allocate a storage space in the target MBS. The detailed algorithm in the MBS monitor is summarized in Algorithm 1.
Algorithm 1.
MBS Monitor
3. Evaluation
Testbed: To evaluate the performance of the MBS-LVM, a NUMA system with four CPUs (10-cores in each CPU) running Linux was used. The MBSs were emulated using DRAMs with the Linux kernel parameter. Each node was configured to be equipped with 16 GB of MBS and 64 GB of MBS clusters. A new file system was also implemented by modifying the Linux tmpfs (called mbsfs) in such a way that it could directly store and read files from the MBS clusters.
Workload: Both synthetic and real workloads were generated to evaluate the performance of the MSBLVM. In the synthetic workload tests, the throughput and scalability of the mbsfs were compared with those of the tmpfs using the fio benchmark by varying the block size (from 1K to 2M) and the number of nodes (from 1 to 4). The number of threads was set to 40. To measure the scalability of the MBS-LVM, the strong scale test, a parallel processing scalability evaluation technique for high-performance computing systems, was used. In order to evaluate the MBS-LVM over real workloads, the MongoDB YCSB (Yahoo! Cloud Serving Benchmark) benchmark, which generates heavy I/O load and is generally used to evaluate NoSQL databases, was used. The type of generated workload was workloada (with a read and write ratio of 50:50) and the number of records was set to 2 million.
Performance comparison with random write and YCSB workloada (read and write ratio of 50:50). (a) Random write throughput, (b) strong scalability test, (c) YCSB (workloada) throughput (high is better), and (d) YCSB (workloada) average latency (low is better).
Result: Fig. 5 shows the performance comparison of the mbsfs and tmpfs by using the fio random write and MongoDB YCSB benchmarks. In the synthetic workload tests using fio with random write, as shown in Fig. 5(a) and (b), the mbsfs outperformed the tmpfs by about 20 times, and scaled well up to 4 nodes compared with the tmpfs. This is because the MBS-LVM allocates the spaces for writing from the MBS with the lowest access latency, while the tmpfs pre-allocates a large amount of memory for writing in advance and uses the memory for all write requests. In this case, remote access is mandatory if a thread requesting write operations is running on different NUMA nodes. If the storage and thread are located on the same node in a NUMA system, it is helpful in improving the performance, but if the storage and thread are located in different nodes, it lowers the performance due to remote access latency. It is also worth noting that the throughput varied as the block sizes were increased, with the mbsfs showing the best performance in the 4 kB block size. This is because I/O operations are sometimes optimized for the page size (i.e., 4 kB) used in the operating systems. In real workload tests using the MongoDB YCSB benchmark with a read and write ratio of 50:50, as shown in Fig. 5(c) and (d), the mbsfs also outperformed the tmpfs by up to 10% in terms of overall throughput, and even showed better performance than the tmpfs in the read operations. Since the write operations preceded the read operations in the benchmark tests, the read latency could be minimized if the threads invoking the write operations read the data from the same NUMA node. It is expected that the performance can be improved further still by increasing the write ratio.
4. Conclusion
This paper proposes an MBS-LVM designed to improve the performance of write operations in NUMA-based servers. The MBS-LVM virtualizes the address space of each MBS and combines them into a single address space. By allocating storage spaces to a local MBS node as much as possible, the MBSLVM improves the performance of applications running on top of the NUMA systems. The superiority of the MBS-LVM was proved by porting it into the tmpfs, a memory-based file system that is used in Linux. The results of benchmarking with both synthetic and real workloads showed that a new file system (mbsfs) using the MSB-LVM outperformed the traditional tmpfs by up to twenty times. The MBS-LVM proposed in this paper provides many benefits with regard to write performance, but does not guarantee a higher performance for read operations compared to existing methods. Further research will be conducted to improve its read performance as well.
Acknowledgement
This paper was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. NRF-2017R1D1A1B03032763).