As SSD prices continue to drop, many tech enthusiasts and enterprises are starting to consider using Ceph to build SSD-based storage pools in pursuit of higher performance. However, to ensure Ceph achieves outstanding performance, choosing the right SSD is critical. In this article, we will explore how to select the appropriate SSD for Ceph.
Contents
Why is SSD Selection So Important?
Currently, consumers can purchase high-capacity NVMe SSDs at relatively low prices, which has attracted many tech enthusiasts and cost-conscious business owners to invest in all-SSD Ceph solutions. However, choosing an SSD based solely on price may not yield the expected performance.
Ceph Write Characteristics
When we talk about Ceph, it is essential to understand how it handles write operations. All Ceph writes are transactional, meaning each write operation waits until it is actually written to all OSDs and synchronized to the disk via fsync(). This also implies that Ceph does not utilize the disk's cache but instead expects data to be written directly to the NAND.
This write behavior is a major challenge for some consumer-grade SSDs. For example, some SSDs claim a performance of 80,000 IOPS, but in actual use, they might only reach 500-1000 IOPS.
In the following test results, you can see that after enabling fsync, the performance of the Samsung 980 SSD is only about 600 IOPS.
fio -ioengine=libaio -name=test -filename=/dev/nvme0n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=2046KiB/s][w=511 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3054244: Wed Oct 18 08:03:48 2023
write: IOPS=574, BW=2297KiB/s (2352kB/s)(33.7MiB/15001msec); 0 zone resets
slat (nsec): min=1804, max=27462, avg=4744.29, stdev=1192.14
clat (nsec): min=1082, max=72217, avg=16433.32, stdev=2111.01
lat (nsec): min=12904, max=76816, avg=21177.61, stdev=2867.60
clat percentiles (nsec):
| 1.00th=[13248], 5.00th=[14912], 10.00th=[15040], 20.00th=[15296],
| 30.00th=[15424], 40.00th=[15680], 50.00th=[15808], 60.00th=[16064],
| 70.00th=[16512], 80.00th=[17792], 90.00th=[18560], 95.00th=[19584],
| 99.00th=[22912], 99.50th=[25984], 99.90th=[31616], 99.95th=[43264],
| 99.99th=[72192]
bw ( KiB/s): min= 2040, max= 2440, per=100.00%, avg=2305.38, stdev=85.23, samples=29
iops : min= 510, max= 610, avg=576.34, stdev=21.31, samples=29
lat (usec) : 2=0.01%, 20=96.45%, 50=3.51%, 100=0.03%
fsync/fdatasync/sync_file_range:
sync (usec): min=1108, max=6437, avg=1735.07, stdev=240.93
sync percentiles (usec):
| 1.00th=[ 1188], 5.00th=[ 1565], 10.00th=[ 1582], 20.00th=[ 1631],
| 30.00th=[ 1663], 40.00th=[ 1680], 50.00th=[ 1696], 60.00th=[ 1713],
| 70.00th=[ 1778], 80.00th=[ 1827], 90.00th=[ 1893], 95.00th=[ 2024],
| 99.00th=[ 2638], 99.50th=[ 2966], 99.90th=[ 4047], 99.95th=[ 5473],
| 99.99th=[ 6456]
cpu : usr=0.29%, sys=0.89%, ctx=25839, majf=0, minf=12
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,8615,0,8615 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=2297KiB/s (2352kB/s), 2297KiB/s-2297KiB/s (2352kB/s-2352kB/s), io=33.7MiB (35.3MB), run=15001-15001msec
Disk stats (read/write):
nvme0n1: ios=46/17115, merge=0/0, ticks=10/14751, in_queue=29383, util=99.46%
Power Loss Protection: A Feature You Can't Ignore
Power Loss Protection (PLP) on an SSD ensures that all write operations can be completed correctly during a sudden power failure. This makes the SSD cache behave more like non-volatile storage, allowing the SSD controller to safely ignore fsync() with the confidence that data will be written correctly.
SSD Power Loss Protection is a feature that ensures data in the disk's cache can still be written to NAND during an unexpected power outage. When an SSD has this feature, it makes the SSD cache behave more like non-volatile storage, and the controller can safely ignore fsync because it trusts that data will be successfully written even after a sudden power loss.
Comparing the test results of an SSD with Power Loss Protection (Kioxia CD6) to the Samsung 980 above, you can see an average of 36.6k IOPS:
fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=144MiB/s][w=36.9k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3720075: Wed Oct 18 17:04:25 2023
write: IOPS=36.6k, BW=143MiB/s (150MB/s)(2144MiB/15001msec); 0 zone resets
slat (usec): min=3, max=1828, avg= 5.84, stdev= 3.94
clat (nsec): min=880, max=1772.2k, avg=17675.95, stdev=5502.60
lat (usec): min=14, max=1867, avg=23.52, stdev= 6.86
clat percentiles (usec):
| 1.00th=[ 16], 5.00th=[ 17], 10.00th=[ 18], 20.00th=[ 18],
| 30.00th=[ 18], 40.00th=[ 18], 50.00th=[ 18], 60.00th=[ 18],
| 70.00th=[ 18], 80.00th=[ 19], 90.00th=[ 19], 95.00th=[ 20],
| 99.00th=[ 24], 99.50th=[ 28], 99.90th=[ 50], 99.95th=[ 69],
| 99.99th=[ 151]
bw ( KiB/s): min=134970, max=167288, per=100.00%, avg=146491.97, stdev=6017.07, samples=29
iops : min=33742, max=41822, avg=36622.90, stdev=1504.33, samples=29
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=96.92%, 50=2.95%
lat (usec) : 100=0.08%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=2, max=1790, avg=20.43, stdev= 5.76
sync percentiles (usec):
| 1.00th=[ 17], 5.00th=[ 20], 10.00th=[ 20], 20.00th=[ 20],
| 30.00th=[ 20], 40.00th=[ 20], 50.00th=[ 21], 60.00th=[ 21],
| 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 22], 95.00th=[ 22],
| 99.00th=[ 28], 99.50th=[ 32], 99.90th=[ 55], 99.95th=[ 74],
| 99.99th=[ 159]
cpu : usr=19.77%, sys=40.01%, ctx=549142, majf=0, minf=24
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,548737,0,548736 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=143MiB/s (150MB/s), 143MiB/s-143MiB/s (150MB/s-150MB/s), io=2144MiB (2248MB), run=15001-15001msec
Disk stats (read/write):
nvme2n1: ios=0/542994, merge=0/0, ticks=0/7889, in_queue=7890, util=99.36%
Even the SATA-based Intel DC S3500 SSD performs much better than the Samsung 980.
fio -ioengine=libaio -name=test -filename=/dev/sda -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=34.0MiB/s][w=8712 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3172: Thu Oct 19 04:58:04 2023
write: IOPS=8713, BW=34.0MiB/s (35.7MB/s)(511MiB/15001msec); 0 zone resets
slat (nsec): min=5381, max=60223, avg=5695.98, stdev=404.63
clat (usec): min=35, max=289, avg=49.89, stdev=15.41
lat (usec): min=44, max=295, avg=55.59, stdev=15.42
clat percentiles (usec):
| 1.00th=[ 41], 5.00th=[ 41], 10.00th=[ 42], 20.00th=[ 42],
| 30.00th=[ 43], 40.00th=[ 43], 50.00th=[ 44], 60.00th=[ 47],
| 70.00th=[ 49], 80.00th=[ 53], 90.00th=[ 69], 95.00th=[ 82],
| 99.00th=[ 113], 99.50th=[ 127], 99.90th=[ 182], 99.95th=[ 206],
| 99.99th=[ 265]
bw ( KiB/s): min=34672, max=34992, per=100.00%, avg=34877.79, stdev=57.17, samples=29
iops : min= 8668, max= 8748, avg=8719.45, stdev=14.29, samples=29
lat (usec) : 50=74.10%, 100=24.24%, 250=1.64%, 500=0.03%
fsync/fdatasync/sync_file_range:
sync (usec): min=81, max=432, avg=108.51, stdev=32.18
sync percentiles (usec):
| 1.00th=[ 87], 5.00th=[ 88], 10.00th=[ 89], 20.00th=[ 90],
| 30.00th=[ 91], 40.00th=[ 92], 50.00th=[ 96], 60.00th=[ 100],
| 70.00th=[ 110], 80.00th=[ 124], 90.00th=[ 149], 95.00th=[ 163],
| 99.00th=[ 265], 99.50th=[ 297], 99.90th=[ 355], 99.95th=[ 367],
| 99.99th=[ 392]
cpu : usr=1.86%, sys=14.13%, ctx=392131, majf=0, minf=12
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,130710,0,130709 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=34.0MiB/s (35.7MB/s), 34.0MiB/s-34.0MiB/s (35.7MB/s-35.7MB/s), io=511MiB (535MB), run=15001-15001msec
Disk stats (read/write):
sda: ios=0/258623, merge=0/0, ticks=0/13669, in_queue=20496, util=99.21%
How to Test SSD Performance?
A good way to confirm if an SSD is truly suitable for use in Ceph is to conduct a performance test. Below is a recommended test command:
fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
The -fsync=1 ensures that every write is synchronized to the SSD, which reflects how Ceph actually operates.
Conclusion: Best Strategies for Choosing SSDs
Based on the discussion above, when choosing an SSD suitable for Ceph, you can follow these two basic principles:
- Enterprise NVMe > Enterprise SATA/SAS >>>>>> Consumer NVMe/SATA/SAS.
- SSDs should feature power-loss protection.
By making your selection based on these principles, you can ensure that your Ceph environment achieves optimal performance.
Appendix
Testing Environment
Samsung 980 1TB
- Processor: AMD Epyc 7413
- Memory: 8 x 32 GB DDR4 3200 RDIMM
- Kernel: 6.1.0-9-amd64
Kioxia CD6 3.84TB
- Processors: 2 x Ampere Altra Q80-30
- RAM: 4 x 32GB DDR4 3200 RDIMM
- Kernel: 6.1.0-12-arm64
Intel DC S3500 1.6 TB
- CPU: AMD Epyc 7302P
- RAM: 4 x 32GB DDR4 2933 RDIMM
- Kernel: 6.2.16-15-pve
Reference
Copyright Notice: All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise specified.

