Direkt zum Inhalt

Atomic Test And Set Of Disk Block Returned False For Equality

Technical Review: Atomic Test-and-Set Returning False (Equality Failure)

service can enter a degraded state, making the host unresponsive to management commands. How to Fix It

Environment:

10-node Ceph cluster, BlueStore backend, NVMe-over-Fabrics. Error: OSD logs repeated: bluestore/StupidAllocator.cc: atomic test and set of disk block 0x4a20b returned false for equality . Root cause: A network partition caused two OSDs to believe they held the same allocation bitmap lock. The storage array (NVMe target) correctly rejected the second OSD’s compare-and-write. Fix: Reduced osd_heartbeat_grace from 20s to 5s, enabled faster fencing, and implemented retry logic with jitter. If false is returned consistently, the system enters

Set:

If it matches (equality), the host updates the block with its own signature to claim ownership. If false is returned consistently

If false is returned consistently, the system enters a spin-loop.
  • CPU Waste: On single-core systems, spinning while receiving false wastes CPU cycles that could be used by the lock holder to release the lock.
  • Cache Coherency Traffic: In multi-core systems, repeated TS instructions returning false generate high traffic on the memory bus (invalidating and re-reading the "lock" variable cache line), known as "cache thrashing."