Back to Home
Why Veltrix Thought It Could Buy Its Way Out of a Distributed Lock Problem

Why Veltrix Thought It Could Buy Its Way Out of a Distributed Lock Problem

B
Blizine Admin
·2 min read·0 views

Lillian Dube Posted on May 31 Why Veltrix Thought It Could Buy Its Way Out of a Distributed Lock Problem # webdev # programming # architecture # systems The Problem We Were Actually Solving Hytale runs a server-side treasure hunt engine that must hand out unique rewards every second during global events. Each reward is a non-fungible item, so we needed strict linearizable ordering: the same treasure ID must never be emitted twice, even if the cluster partitioned. The business rule was simple: no duplicate keys, no manual recovery, SLA 5 ms p99 latency. Redis Cluster gave us eventual consistency within the slot shard, but it could not do cross-slot linearizable writes. When the cluster rebalanced—even for a second—requests started to race, and we saw duplicate quest keys in prod logs. That violated the spec, and we had to backfill 14,000 duplicate items in the account database. What actually broke was not Redis itself; it was the optimistic assumption that Redis Cluster could behave like a single atomic register under partial failure. The client library redislock-py was retrying with exponential backoff, but without a fencing token, two clients could both believe theyd won the lock and emit the same treasure ID. The error we chased for two days was MISCONF Redis is configured to save RDB snapshots, but the replica is too slow to persist, which masked the real race: two processes incrementing the same counter under split-brain. What We Tried First (And Why It Failed) First fix: shard the writes per realm so each key is single-slot. We rolled out a realm-to-slot mapping, but the mapping table grew to 120 MB and had to live in client memory. Any realm rebalance still forced a full client rollout, and we hit a bug in hytale-realm-client where the in-memory map was stale after a ZooKeeper re-election, leading to slot not served 3.2 % of the time. Second fix: use Redlock algorithm inside the game service. We pulled the redis-py Redlock implementation and ran it against a 9

📰Dev.to — dev.to

Comments