With the continuous decline in SSD prices, many enthusiasts and enterprises are beginning to explore using Ceph to build storage pools based on SSDs, aiming for higher performance. However, to achieve optimal Ceph performance, selecting the right SSD is critical. In this article, we will explore how to choose SSDs that are well-suited for Ceph.
Table of Contents
Why is SSD selection so important?
Today, consumers can purchase high-capacity NVMe SSDs at relatively low prices, attracting many technology enthusiasts and cost-conscious enterprises to fully invest in SSD-based Ceph solutions. However, simply choosing SSDs based on price may not achieve the expected performance.
Ceph's unique features
When evaluating Ceph, it's essential to understand how it handles input operations. All Ceph input operations are transactional, meaning each write operation must wait until it is confirmed across all OSDs and synced to the disk via fsync(). This also reveals that Ceph does not rely on disk caching, but rather expects data to be directly written to NAND.
This write behavior poses a major challenge for some consumer-grade SSDs. For example, some SSDs advertise performance of 80,000 IOPS, but in actual use, they may only achieve 500–1,000 IOPS.
The benchmark results show that after enabling fsync, Samsung 980 SSD achieves only around 600 IOPS.
fio -ioengine=libaio -name=test -filename=/dev/nvme0n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=2046KiB/s][w=511 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3054244: Wed Oct 18 08:03:48 2023
write: IOPS=574, BW=2297KiB/s (2352kB/s)(33.7MiB/15001msec); 0 zone resets
slat (nsec): min=1804, max=27462, avg=4744.29, stdev=1192.14
clat (nsec): min=1082, max=72217, avg=16433.32, stdev=2111.01
lat (nsec): min=12904, max=76816, avg=21177.61, stdev=2867.60
clat percentiles (nsec):
| 1.00th=[13248], 5.00th=[14912], 10.00th=[15040], 20.00th=[15296],
| 30.00th=[15424], 40.00th=[15680], 50.00th=[15808], 60.00th=[16064],
| 70.00th=[16512], 80.00th=[17792], 90.00th=[18560], 95.00th=[19584],
| 99.00th=[22912], 99.50th=[25984], 99.90th=[31616], 99.95th=[43264],
| 99.99th=[72192]
bw ( KiB/s): min= 2040, max= 2440, per=100.00%, avg=2305.38, stdev=85.23, samples=29
iops : min= 510, max= 610, avg=576.34, stdev=21.31, samples=29
lat (usec) : 2=0.01%, 20=96.45%, 50=3.51%, 100=0.03%
fsync/fdatasync/sync_file_range:
sync (usec): min=1108, max=6437, avg=1735.07, stdev=240.93
sync percentiles (usec):
| 1.00th=[ 1188], 5.00th=[ 1565], 10.00th=[ 1582], 20.00th=[ 1631],
| 30.00th=[ 1663], 40.00th=[ 1680], 50.00th=[ 1696], 60.00th=[ 1713],
| 70.00th=[ 1778], 80.00th=[ 1827], 90.00th=[ 1893], 95.00th=[ 2024],
| 99.00th=[ 2638], 99.50th=[ 2966], 99.90th=[ 4047], 99.95th=[ 5473],
| 99.99th=[ 6456]
cpu : usr=0.29%, sys=0.89%, ctx=25839, majf=0, minf=12
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,8615,0,8615 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=2297KiB/s (2352kB/s), 2297KiB/s-2297KiB/s (2352kB/s-2352kB/s), io=33.7MiB (35.3MB), run=15001-15001msec
Disk stats (read/write):
nvme0n1: ios=46/17115, merge=0/0, ticks=10/14751, in_queue=29383, util=99.46%
Power loss protection: An essential feature
SSD's power-loss protection ensures that all write operations complete accurately even during sudden power loss. This makes SSDs more reliably non-volatile, allowing the controller to safely skip fsync() operations with confidence that data will be properly stored.
SSD's power-loss protection ensures that data remains safely written to NAND even during unexpected power loss. When an SSD supports this feature, it becomes truly non-volatile, enabling the controller to safely skip fsync() operations, as data will still be reliably stored even if power is lost suddenly.
The benchmark results for a power-loss protected SSD (Koxia CD6) compared to the Samsung 980 show an average of 36.6k IOPS.
fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=144MiB/s][w=36.9k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3720075: Wed Oct 18 17:04:25 2023
write: IOPS=36.6k, BW=143MiB/s (150MB/s)(2144MiB/15001msec); 0 zone resets
slat (usec): min=3, max=1828, avg= 5.84, stdev= 3.94
clat (nsec): min=880, max=1772.2k, avg=17675.95, stdev=5502.60
lat (usec): min=14, max=1867, avg=23.52, stdev= 6.86
clat percentiles (usec):
| 1.00th=[ 16], 5.00th=[ 17], 10.00th=[ 18], 20.00th=[ 18],
| 30.00th=[ 18], 40.00th=[ 18], 50.00th=[ 18], 60.00th=[ 18],
| 70.00th=[ 18], 80.00th=[ 19], 90.00th=[ 19], 95.00th=[ 20],
| 99.00th=[ 24], 99.50th=[ 28], 99.90th=[ 50], 99.95th=[ 69],
| 99.99th=[ 151]
bw ( KiB/s): min=134970, max=167288, per=100.00%, avg=146491.97, stdev=6017.07, samples=29
iops : min=33742, max=41822, avg=36622.90, stdev=1504.33, samples=29
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=96.92%, 50=2.95%
lat (usec) : 100=0.08%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=2, max=1790, avg=20.43, stdev= 5.76
sync percentiles (usec):
| 1.00th=[ 17], 5.00th=[ 20], 10.00th=[ 20], 20.00th=[ 20],
| 30.00th=[ 20], 40.00th=[ 20], 50.00th=[ 21], 60.00th=[ 21],
| 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 22], 95.00th=[ 22],
| 99.00th=[ 28], 99.50th=[ 32], 99.90th=[ 55], 99.95th=[ 74],
| 99.99th=[ 159]
cpu : usr=19.77%, sys=40.01%, ctx=549142, majf=0, minf=24
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,548737,0,548736 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=143MiB/s (150MB/s), 143MiB/s-143MiB/s (150MB/s-150MB/s), io=2144MiB (2248MB), run=15001-15001msec
Disk stats (read/write):
nvme2n1: ios=0/542994, merge=0/0, ticks=0/7889, in_queue=7890, util=99.36%
Even SATA-based Intel DC S3500 SSDs perform significantly better than the Samsung 980.
fio -ioengine=libaio -name=test -filename=/dev/sda -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=34.0MiB/s][w=8712 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3172: Thu Oct 19 04:58:04 2023
write: IOPS=8713, BW=34.0MiB/s (35.7MB/s)(511MiB/15001msec); 0 zone resets
slat (nsec): min=5381, max=60223, avg=5695.98, stdev=404.63
clat (usec): min=35, max=289, avg=49.89, stdev=15.41
lat (usec): min=44, max=295, avg=55.59, stdev=15.42
clat percentiles (usec):
| 1.00th=[ 41], 5.00th=[ 41], 10.00th=[ 42], 20.00th=[ 42],
| 30.00th=[ 43], 40.00th=[ 43], 50.00th=[ 44], 60.00th=[ 47],
| 70.00th=[ 49], 80.00th=[ 53], 90.00th=[ 69], 95.00th=[ 82],
| 99.00th=[ 113], 99.50th=[ 127], 99.90th=[ 182], 99.95th=[ 206],
| 99.99th=[ 265]
bw ( KiB/s): min=34672, max=34992, per=100.00%, avg=34877.79, stdev=57.17, samples=29
iops : min= 8668, max= 8748, avg=8719.45, stdev=14.29, samples=29
lat (usec) : 50=74.10%, 100=24.24%, 250=1.64%, 500=0.03%
fsync/fdatasync/sync_file_range:
sync (usec): min=81, max=432, avg=108.51, stdev=32.18
sync percentiles (usec):
| 1.00th=[ 87], 5.00th=[ 88], 10.00th=[ 89], 20.00th=[ 90],
| 30.00th=[ 91], 40.00th=[ 92], 50.00th=[ 96], 60.00th=[ 100],
| 70.00th=[ 110], 80.00th=[ 124], 90.00th=[ 149], 95.00th=[ 163],
| 99.00th=[ 265], 99.50th=[ 297], 99.90th=[ 355], 99.95th=[ 367],
| 99.99th=[ 392]
cpu : usr=1.86%, sys=14.13%, ctx=392131, majf=0, minf=12
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,130710,0,130709 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=34.0MiB/s (35.7MB/s), 34.0MiB/s-34.0MiB/s (35.7MB/s-35.7MB/s), io=511MiB (535MB), run=15001-15001msec
Disk stats (read/write):
sda: ios=0/258623, merge=0/0, ticks=0/13669, in_queue=20496, util=99.21%
How to test SSD performance?
To verify whether an SSD is truly suitable for Ceph, a recommended approach is to run performance benchmarks. Below is a suggested benchmark command:
fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
where -fsync=1 ensures every write is synchronized to the SSD, reflecting Ceph’s actual operational model.
Conclusion: Best strategy for choosing SSDs
Based on the above discussion, when selecting an SSD for Ceph, you can follow these two basic principles:
- Enterprise-grade NVMe > Enterprise-grade SATA/SAS >>>>> Consumer-grade NVMe/SATA/SAS.
- SSD comes with built-in power-loss protection.
As long as you adhere to these principles, you can confidently ensure your Ceph environment achieves optimal performance.
Appendix
Test Environment
Samsung 980 1TB
- CPU: AMD Epyc 7413
- RAM: 8 x 32GB DDR4 3200 RDIMM
- kernel: 6.1.0-9-amd64
Koxia CD6 3.84TB
- CPU: 2 x Ampere Altra Q80-30
- RAM: 4 x 32GB DDR4 3200 RDIMM
- kernel: 6.1.0-12-arm64
Intel DC S3500 1.6TB
- CPU: AMD Epyc 7302P
- RAM: 4 x 32GB DDR4 2933 RDIMM
- kernel: 6.2.16-15-pve
Reference
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.