The Universal Primitive - How CAS Became the Foundation of Concurrent Programming

February 3, 2026 · 31 min read

Software Engineer, AWS

This blog post is inspired by the first 6 chapters of The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit.

Imagine building a distributed counter that must handle millions of updates per second across dozens of threads. Traditional locks serialize access, creating bottlenecks. You need something better: a way for threads to coordinate without blocking, without deadlocks, without the performance collapse that comes with contention. This isn't just a performance optimization problem; it's a fundamental question about what synchronization primitives are actually necessary. Can we build wait-free concurrent data structures? Which hardware instructions must processors provide? The answer, discovered through decades of theoretical work, reveals that one primitive, Compare-And-Swap (CAS), is universal.

Modern processors have dozens of cores, and our applications must leverage them all. From distributed databases processing millions of transactions per second to real-time analytics engines crunching streaming data, concurrent programming has moved from a specialized skill to a fundamental requirement. Yet building correct concurrent systems remains notoriously difficult: race conditions lurk in seemingly innocent code, deadlocks emerge from complex lock hierarchies, and performance bottlenecks appear where we least expect them.

The real challenge isn't just making threads cooperate; it's understanding which synchronization primitives actually give us the power we need. Over decades, computer scientists developed countless approaches: Peterson's algorithm for mutual exclusion, Lamport's bakery algorithm for fairness, sophisticated lock implementations with elaborate protocols. But a deeper question remained unanswered: Are these atomic building blocks fundamentally different in power? Can some primitives solve problems that others simply cannot? More critically for systems engineers: If we're designing hardware or choosing synchronization mechanisms for a new platform, which primitives must we provide?

The answer fundamentally changed how we think about concurrent programming: Compare-And-Swap (CAS) is universal. This isn't marketing hyperbole: it's a mathematically proven property. Any concurrent object that can be specified sequentially can be implemented in a wait-free manner using CAS. From simple locks to complex data structures, from blocking algorithms to non-blocking ones, CAS provides sufficient power to construct them all. This universality explains why every modern processor, from ARM to x86, from mobile chips to data center CPUs, provides CAS or its equivalent as a fundamental instruction.

But what exactly is CAS? At its core, Compare-And-Swap is an atomic operation that reads a memory location, compares it to an expected value, and only updates it to a new value if the comparison succeeds. In Java, this is exposed through classes like AtomicInteger:

import java.util.concurrent.atomic.AtomicInteger;

// CAS operation: atomically compare and update
AtomicInteger counter = new AtomicInteger(0);  // Initialize to 0

// Thread 1: Try to increment from 0 to 1
int expected = 0;        // What we expect the current value to be
int newValue = 1;         // What we want to set it to
boolean success = counter.compareAndSet(expected, newValue);
// If counter was 0, it's now 1 and success = true
// If counter was already changed by another thread, success = false

// Thread 2 (concurrent): Also tries to increment
int myExpected = 0;
int myNewValue = 1;
boolean mySuccess = counter.compareAndSet(myExpected, myNewValue);
// Only one thread will succeed - CAS guarantees atomicity

The key insight is that compareAndSet executes atomically: it reads the current value, compares it to expected, and only updates to newValue if they match. If another thread modified the value between the read and write, the operation fails and returns false, allowing the thread to retry. This atomicity, the guarantee that the comparison and update happen as a single, indivisible operation, is what makes CAS powerful enough to build universal constructions.

Understanding this journey from basic mutual exclusion to universal constructions isn't just academic: it's the foundation for reasoning about concurrent systems at scale.

In this post, we'll explore:

Early mutual exclusion algorithms (Peterson's, Bakery) that revealed the limitations of read/write operations
Formal definitions of correctness (linearizability) and progress conditions that enable rigorous reasoning
The consensus hierarchy that measures primitive power and reveals fundamental limitations
Universal constructions that prove CAS can implement any concurrent object wait-free

The Synchronization Problem

Mutual exclusion is the foundational problem in concurrent programming, and early solutions revealed both the possibility and the inherent limitations of using only read/write operations. When multiple threads access shared resources, we need mechanisms to ensure critical sections execute atomically: one thread at a time. The pioneering algorithms from the 1960s through 1980s demonstrated that mutual exclusion could be achieved using only memory reads and writes, but at a cost that foreshadowed deeper theoretical constraints.

Peterson's algorithm elegantly solved mutual exclusion for two threads using just two flags and a turn variable. Each thread signals its intent to enter the critical section, then yields priority to the other thread. The beauty lies in its simplicity: no special hardware instructions required, just careful ordering of reads and writes. Yet this simplicity masks a critical limitation: it only works for two threads. Extending Peterson's approach to n threads requires exponentially complex tournament trees, and even then, threads must actively spin while waiting, burning CPU cycles.

Here's Peterson's algorithm implemented in Java, showing how mutual exclusion is achieved using only read/write operations:

class PetersonsLock {
    // Flags indicating each thread's desire to enter critical section
    private volatile boolean[] flag = new boolean[2];
    // Turn variable: which thread should yield priority
    private volatile int turn;
    
    // Thread 0 calls this to acquire the lock
    public void lock0() {
        flag[0] = true;           // Signal intent to enter
        turn = 1;                 // Give priority to thread 1
        // Wait while thread 1 wants to enter AND it's thread 1's turn
        while (flag[1] && turn == 1) {
            // Busy-wait: spin until condition is false
            Thread.yield();       // Hint to scheduler (optional)
        }
        // Now in critical section
    }
    
    // Thread 1 calls this to acquire the lock
    public void lock1() {
        flag[1] = true;           // Signal intent to enter
        turn = 0;                 // Give priority to thread 0
        // Wait while thread 0 wants to enter AND it's thread 0's turn
        while (flag[0] && turn == 0) {
            Thread.yield();
        }
        // Now in critical section
    }
    
    // Thread 0 releases the lock
    public void unlock0() {
        flag[0] = false;          // Signal we're done
    }
    
    // Thread 1 releases the lock
    public void unlock1() {
        flag[1] = false;          // Signal we're done
    }
}

The algorithm works through careful coordination: each thread sets its flag to true (indicating desire to enter), then sets turn to favor the other thread. The thread waits (spins) only if both threads want to enter AND it's the other thread's turn. This ensures mutual exclusion: at most one thread can be in the critical section. However, notice the while loop: threads must continuously check the condition, consuming CPU cycles. This busy-waiting is the blocking behavior that weaker primitives force upon us.

Why this matters in practice: In real systems, busy-waiting wastes CPU cycles that could be used for productive work. If a thread holding a lock is preempted or runs slowly, all waiting threads spin uselessly, consuming power and reducing overall throughput. This is why blocking algorithms can perform poorly under contention: they're vulnerable to priority inversion, convoying (where slow threads delay fast ones), and wasted CPU cycles. Modern systems need synchronization mechanisms that provide progress guarantees even when some threads are slow or crash.

Lamport's bakery algorithm generalized the solution to n threads by drawing inspiration from a bakery's ticket system. Each thread takes a number, and threads enter in numerical order. This achieved both safety (mutual exclusion) and fairness (first-come, first-served), making it a significant theoretical advance. The algorithm's elegance comes at the price of complexity: comparing ticket numbers requires careful handling of ties, and threads must scan all other threads' tickets before entering. More critically, like Peterson's algorithm, bakery forces threads into busy-waiting: they're blocked not by sleeping, but by continuously checking conditions.

What these algorithms collectively demonstrate is profound: mutual exclusion is achievable with read/write operations alone, but the resulting solutions are inherently blocking. Whether through busy-waiting spins in Peterson's protocol or ticket-checking loops in bakery, threads cannot make progress independently. They must wait, they must check, they must coordinate through shared memory locations that require constant polling. This blocking nature isn't a flaw in the algorithms: it's a fundamental consequence of the weakness of read/write operations themselves, a limitation that would prove mathematically inevitable.

But to prove that limitation mathematically, and to understand which primitives are truly necessary, we need formal definitions. What does it mean for a concurrent algorithm to be "correct"? How do we characterize different levels of blocking? These questions require precise answers before we can establish the hierarchy of primitive power.

Defining Correctness

Before we can evaluate synchronization mechanisms or prove algorithms correct, we must first define what "correct" actually means in concurrent systems. In sequential programming, correctness is straightforward: given the same inputs, your function produces the expected output. But in concurrent programming, operations overlap in time, multiple threads interleave their actions unpredictably, and the same sequence of function calls can produce different results depending on timing. Without a formal definition of correctness, we're left arguing subjectively about whether an implementation "works" or debating whether a test failure represents a real bug or just unfortunate timing.

Linearizability provides the gold standard: a concurrent execution is correct if it appears equivalent to some sequential execution, where each operation takes effect instantaneously at some point between its invocation and response. This "linearization point" gives us a powerful mental model: despite the chaos of concurrent operations overlapping in time, we can reason about them as if they happened one at a time in some valid order. A concurrent queue is correct if it behaves like a sequential FIFO queue, just with operations atomically "snapping" into place at their linearization points. Critically, linearizability is compositional: if each individual object in your system is linearizable, the entire system is linearizable. This compositionality is what makes large-scale concurrent systems tractable: you can reason about components independently without worrying about how their combination might violate correctness.

A quick example makes this concrete. Suppose three threads hit a queue concurrently:

Time -------------------------------------------->

T1:  |--- enqueue(5) ---|
T2:       |--- enqueue(7) ---|
T3:            |-- dequeue() --|

These operations overlap, but linearizability says we can pick one instant within each operation's interval where it "takes effect." One valid linearization order: enqueue(5), then enqueue(7), then dequeue() → 5. The queue behaves as if those three calls happened sequentially in that order. Verifying correctness reduces to: does some valid sequential ordering exist that respects the real-time constraints?

Beyond correctness, we need to characterize how much blocking we're willing to tolerate, which progress conditions formalize into a precise hierarchy. Wait-freedom is the strongest guarantee: every thread completes its operation in a bounded number of steps, regardless of what other threads do, even if they crash or run arbitrarily slowly. Lock-freedom weakens this slightly: at least one thread always makes progress, though individual threads might starve. Obstruction-freedom weakens further: a thread makes progress if it eventually runs without interference. At the bottom sits traditional blocking synchronization using locks, where threads can wait indefinitely. This hierarchy isn't just theoretical taxonomy: it has direct performance implications. Wait-free algorithms never stall on slow threads, making them ideal for real-time systems. Lock-free algorithms avoid deadlock and convoying but may starve individual threads. Blocking algorithms are simpler to write but vulnerable to priority inversion, deadlock, and performance collapse under contention.

The progress conditions form a clear hierarchy from strongest to weakest guarantees:

Progress Condition	Guarantee	Notes
Wait-Freedom	Every thread completes in bounded steps	Even if others crash or are slow
Lock-Freedom	At least one thread always makes progress	Individual threads may starve
Obstruction-Freedom	Thread makes progress if it runs alone	May block under contention
Blocking (Locks)	Threads can wait indefinitely	Deadlock, convoying possible

Each level weakens the guarantee: wait-freedom promises per-thread progress, lock-freedom promises system-wide progress, obstruction-freedom promises progress only when uncontended, and blocking makes no progress guarantees. This hierarchy helps us choose the right progress condition for our use case: real-time systems need wait-freedom, while many high-performance systems can tolerate lock-freedom's potential starvation.

These definitions, linearizability for correctness and progress conditions for liveness, form the vocabulary that makes rigorous reasoning about concurrent systems possible. Without linearizability, we couldn't formally state what it means for a concurrent hash table or queue to be "correct." Without progress conditions, we couldn't distinguish between a lock-free algorithm that guarantees system-wide progress and a wait-free algorithm that guarantees per-thread progress. More importantly, these definitions set up the critical questions that follow: Can we achieve wait-freedom with just read/write operations? Do different atomic primitives offer different guarantees? The precision of these definitions enables the mathematical proofs and impossibility results that come next.

Armed with these definitions, we can now ask the fundamental question: Are all synchronization primitives equally powerful, or do some offer capabilities that others simply cannot provide? The answer, discovered through the consensus problem, reveals a strict hierarchy that explains why modern processors provide CAS.

Primitive Power Hierarchy

Not all atomic operations are created equal: some primitives are fundamentally more powerful than others, capable of solving problems that weaker primitives cannot. We've seen that read/write operations suffice for mutual exclusion through algorithms like Peterson's and Bakery. We've defined correctness through linearizability and characterized blocking through progress conditions. But a critical question remains: Are the primitives we choose merely a matter of convenience and performance, or do they fundamentally determine what's algorithmically possible? Can we achieve wait-free synchronization with read/write registers alone, or do we need stronger hardware support?

Atomic registers establish the baseline by exploring what they can and cannot achieve. Atomic registers, memory locations supporting atomic read and write operations, form the weakest primitive in our hierarchy. They demonstrate their power through atomic snapshots, a technique that allows multiple registers to be read "simultaneously" in a consistent state despite concurrent updates. An atomic snapshot reads all registers atomically, giving a consistent view even if other threads are modifying them. Multi-reader, multi-writer registers can be constructed from single-writer registers using techniques like atomic snapshots, proving that certain concurrent abstractions are achievable with patient engineering. Yet throughout these constructions, a pattern emerges: algorithms using only registers require threads to help each other, retry operations, and fundamentally cannot guarantee that every thread completes in bounded steps. The constructions work, but they're complex, and they hint at fundamental limitations lurking beneath the surface.

The consensus problem and its associated hierarchy make these limitations precise. Consensus is deceptively simple: n threads each propose a value, and they must all agree on one of the proposed values. It's the atomic commitment problem at the heart of distributed systems, the "all or nothing" decision that underlies everything from database transactions to leader election.

Critically, consensus is different from mutual exclusion. While Lamport's bakery algorithm demonstrates that mutual exclusion can be solved for n threads using only read/write operations (albeit with blocking), consensus is a fundamentally harder problem. Mutual exclusion ensures only one thread accesses a resource at a time: it's about exclusion. Consensus requires all threads to agree on a single value: it's about agreement. More importantly, the consensus number measures a primitive's ability to solve consensus wait-free, not just solve it with blocking. While read/write operations can achieve mutual exclusion for many threads through blocking algorithms, they cannot achieve wait-free consensus for even two threads.

Here's a concrete example of the consensus problem:

// Consensus problem: n threads propose values, all must agree on one
// Example scenario:
//   Thread 1 proposes: "Alice"
//   Thread 2 proposes: "Bob"  
//   Thread 3 proposes: "Alice"
// All threads must agree on either "Alice" or "Bob" (one of the proposed values)
// This is harder than mutual exclusion because it requires agreement, not just exclusion

Herlihy's breakthrough insight was that consensus serves as a measuring stick for primitive power. Every synchronization primitive has a "consensus number": the maximum number of threads for which it can solve consensus wait-free. Read/write registers have consensus number 1 (they can't even solve two-thread consensus wait-free). Test-and-set and swap have consensus number 2. Compare-and-swap, along with Load-Linked/Store-Conditional, have consensus number infinity: they can solve consensus for any number of threads.

Here's how CAS solves 2-thread consensus, demonstrating its power:

import java.util.concurrent.atomic.AtomicReference;

class Consensus {
    private AtomicReference<Object> decision = new AtomicReference<Object>(null);
    
    /**
     * Solve consensus for 2 threads using CAS.
     * Each thread proposes a value, all agree on the first one to succeed.
     */
    public Object decide(Object proposed) {
        // Try to set decision to our proposed value
        // Only the first thread succeeds; others see non-null and return that value
        if (decision.compareAndSet(null, proposed)) {
            // We won! Our value is the decision
            return proposed;
        } else {
            // Another thread already decided; we agree with their choice
            return decision.get();
        }
    }
}

// Usage:
// Thread 1: result = consensus.decide("Alice")  // Might return "Alice" or "Bob"
// Thread 2: result = consensus.decide("Bob")     // Returns same value as Thread 1
// Both threads now have the same result - consensus achieved!

This simple implementation shows why CAS has consensus number infinity: it can solve consensus for any number of threads by ensuring only one thread's proposal wins, and all others agree with that winner.

The consensus number tells us the maximum number of threads for which a primitive can solve the consensus problem wait-free. Primitives with higher consensus numbers are strictly more powerful: they can solve problems that weaker primitives cannot. This isn't just a performance difference; it's a fundamental computational limitation.

Here's a comparison of how different primitives stack up:

Primitive	Consensus Number	Can Solve Mutual Exclusion?	Wait-Free Consensus?	Notes
Read/Write	1	Yes (blocks)	No	Lamport's algorithm works but requires blocking/busy-waiting
Test-and-Set	2	Yes	Yes (2 threads max)	Limited to 2 threads for wait-free consensus
CAS	∞	Yes	Yes	Universal - can solve wait-free consensus for any number of threads

The impossibility result is what makes this hierarchy mathematically rigorous rather than empirical observation: you cannot solve wait-free consensus for two or more threads using only read/write registers. This isn't a statement about clever algorithms we haven't discovered yet: it's a fundamental impossibility proven through valency arguments and careful reasoning about execution schedules. No matter how ingenious your algorithm, no matter how many registers you use or how cleverly you structure them, you cannot build a wait-free consensus protocol for two threads with read/write operations alone. This explains why Peterson's and Bakery algorithms must block: the blocking isn't a design choice, it's a mathematical necessity given their primitive operations. If you want wait-free synchronization for multiple threads, you need primitives with higher consensus numbers. The hierarchy isn't about performance optimization: it's about what's computationally possible.

Why hardware designers care: When designing a processor, you face a fundamental question: which atomic instructions should you provide? The consensus hierarchy provides a clear answer: if you want software to be able to build wait-free concurrent algorithms, you must provide primitives with consensus number infinity (like CAS). Without CAS, certain classes of problems are literally impossible to solve wait-free. This isn't a matter of performance: it's a matter of computational capability. This is why every modern processor architecture converged on providing CAS or its equivalent: they recognized that weak primitives fundamentally limit what software can achieve.

These insights fundamentally reframe how we think about hardware synchronization support. Processors don't provide compare-and-swap just because it's faster than building complex protocols with reads and writes: they provide it because certain problems are literally impossible to solve wait-free without it. The consensus hierarchy explains why modern architectures converged on CAS-like instructions: they recognized that weak primitives fundamentally limit what software can achieve. This sets up the final revelation: primitives with infinite consensus numbers aren't just powerful: they're universal.

But universality is a bold claim. Does having consensus number infinity mean CAS can solve consensus for many threads, or does it mean something more profound? The universal construction theorem provides the answer: CAS doesn't just solve consensus: it can implement any concurrent object whatsoever.

Universal Solution

The consensus hierarchy revealed a gap between primitives, but primitives with infinite consensus numbers bridge that gap completely. We know read/write registers cannot solve consensus for multiple threads. We know test-and-set gets us to two threads but no further. We know compare-and-swap has consensus number infinity. But infinity is a strange claim: does it simply mean "works for arbitrarily many threads," or does it mean something more profound? The answer comes with mathematical precision: objects that solve consensus for n threads are universal for n threads. They can implement any concurrent object whatsoever.

The universal construction provides the explicit algorithm: given any sequential specification of an object and a consensus primitive, you can build a wait-free concurrent implementation. The construction is elegant in its directness. Maintain a log of operations applied to the object. When a thread wants to perform an operation, it proposes that operation as the "next" one to apply. Threads use consensus to agree on which operation wins. The winner's operation gets appended to the log and applied to the object state. All threads can then compute the result by replaying the log. Repeat for the next operation. This isn't an optimization or a special case: it's a fully general construction that works for any object you can specify sequentially: queues, stacks, hash tables, counters, priority queues, or objects not yet invented.

The universal construction algorithm proceeds as follows:

Operation proposal: When a thread invokes an operation, it creates an operation descriptor containing the operation type and arguments, then proposes this descriptor as the next entry in the shared operation log.
Consensus decision: All threads concurrently proposing operations participate in a consensus protocol. The consensus primitive guarantees that exactly one proposal wins: this is the operation that will be applied next.
Log append: The winning operation descriptor is atomically appended to the shared log. This log serves as the linearization order: operations appear in the order they were decided by consensus.
State reconstruction: Each thread independently replays the log from the beginning, applying each operation sequentially to reconstruct the current object state. Since all threads see the same log, they compute identical states.
Result computation: Threads compute the operation's return value by examining the reconstructed state. For read operations, this is straightforward. For write operations, the result may depend on the state after applying the operation.
Completion: The thread returns the computed result. Since consensus is wait-free (each thread completes in bounded steps), and log replay is deterministic, the entire operation completes wait-free.

The key insight is that consensus serializes operations (establishing a total order), while log replay ensures all threads compute consistent results without requiring explicit coordination beyond the consensus protocol itself.

Here's a simplified example of how the universal construction builds a concurrent queue. The sequential specification is straightforward: a queue supports enqueue(item) and dequeue() operations that follow FIFO order.

// Simplified universal construction for a queue
class UniversalQueue<T> {
    private List<Operation> log = new ArrayList<>();  // Operation log
    private Queue<T> state = new LinkedList<>();      // Sequential state
    
    // Consensus object to decide next operation
    private Consensus<Operation> consensus = new Consensus<>();
    
    public void enqueue(T item) {
        // Propose enqueue operation
        Operation op = new EnqueueOp(item);
        
        // Use consensus to decide if this operation wins
        Operation winner = consensus.decide(op);
        
        // Append winner to log
        synchronized(log) {
            log.add(winner);
        }
        
        // All threads replay log to compute current state
        replayLog();
    }
    
    public T dequeue() {
        Operation op = new DequeueOp();
        Operation winner = consensus.decide(op);
        
        synchronized(log) {
            log.add(winner);
        }
        
        replayLog();
        
        // Return result based on final state
        return state.poll();  // Simplified - actual implementation tracks results
    }
    
    private void replayLog() {
        // Replay all operations to compute current state
        state.clear();
        for (Operation op : log) {
            op.apply(state);
        }
    }
}

This demonstrates the universal construction pattern: operations are proposed, consensus decides the winner, the log grows, and all threads independently compute results. While this simplified version has performance limitations (everyone replays the entire log), optimized versions use techniques like helping and early termination. The key insight is that CAS-based consensus makes this construction wait-free: every thread completes in bounded steps regardless of others' behavior.

Why software engineers benefit: The universal construction provides a systematic recipe for building concurrent objects. Instead of inventing clever tricks for each data structure, you can apply the universal construction to any sequential specification. While optimized implementations often outperform the universal construction, it serves as a correctness proof: if CAS can build it wait-free using the universal construction, then optimized wait-free implementations are possible. This gives you confidence when designing concurrent systems: you know that CAS provides sufficient power to build whatever you need. Real-world systems like Java's ConcurrentHashMap use sophisticated CAS-based algorithms that outperform the universal construction, but the universality theorem guarantees that such implementations exist.

What makes this truly universal is that it guarantees wait-freedom: every thread completes its operation in a bounded number of steps. No thread waits for locks. No thread spins checking conditions. No thread can be blocked by slower threads or crashed threads. Each thread proposes, participates in consensus, computes the result, and completes, all in predictable, bounded time. This is the theoretical ideal of concurrent programming: the responsiveness of sequential code combined with the scalability of parallel execution. The construction proves that wait-freedom isn't some unattainable dream requiring clever tricks for each data structure: it's a systematic consequence of having consensus objects.

This universality extends beyond wait-free algorithms to encompass the entire space of concurrent programming. The construction can implement locks themselves: mutual exclusion becomes just another concurrent object built atop consensus. It can implement semaphores, barriers, read-write locks, any synchronization primitive we've discussed or will invent. More subtly, while the universal construction produces wait-free implementations, consensus objects can also be used to build lock-free or even blocking implementations with different performance tradeoffs. The point isn't that CAS forces you into wait-free algorithms; it's that CAS gives you the power to choose. With weaker primitives like read/write registers, certain algorithmic approaches are simply impossible. With CAS, every approach becomes possible.

Performance Implications

Understanding when CAS helps versus when it might hurt is crucial for practical system design. CAS excels in low-contention scenarios where threads rarely conflict: operations typically succeed on the first attempt, providing excellent performance without the overhead of lock acquisition. CAS also shines when you need progress guarantees: wait-free and lock-free algorithms built with CAS never deadlock and provide stronger liveness guarantees than traditional locks.

However, CAS has trade-offs. Under high contention, CAS can suffer from cache line bouncing: multiple threads repeatedly modifying the same memory location cause expensive cache coherence traffic. In extreme cases, a simple lock might perform better because it serializes access and reduces cache misses. The retry loops in CAS-based algorithms can also waste CPU cycles when many threads compete, though at least one thread always makes progress (lock-freedom).

The choice between wait-free, lock-free, and blocking approaches depends on your requirements:

Wait-free: Best for real-time systems where every thread must complete in bounded time, even if others crash. Higher overhead but strongest guarantees.
Lock-free: Good for high-performance systems where deadlock is unacceptable but some starvation is tolerable. Better scalability than locks under contention.
Blocking (locks): Simplest to reason about and often fastest under high contention due to reduced cache traffic. Vulnerable to deadlock and priority inversion.

Modern systems often use hybrid approaches: CAS for hot paths with low contention, locks for high-contention scenarios, and sophisticated lock-free data structures (like Java's ConcurrentHashMap) that combine multiple techniques.

Here's a practical example: a lock-free counter implemented using CAS. This demonstrates CAS's power in a simple, concrete form:

import java.util.concurrent.atomic.AtomicInteger;

class LockFreeCounter {
    // CAS-based counter: no locks, lock-free increment
    private AtomicInteger value = new AtomicInteger(0);
    
    /**
     * Increment the counter atomically using CAS.
     * This is lock-free: at least one thread makes progress, but retries may be unbounded.
     */
    public void increment() {
        int current;
        do {
            // Read current value
            current = value.get();
            // Try to update: CAS(current, current+1)
            // If another thread changed value, this fails and we retry
        } while (!value.compareAndSet(current, current + 1));
        // Loop exits when CAS succeeds (we won the race)
    }
    
    /**
     * Get the current counter value.
     * This is a simple read, always wait-free.
     */
    public int get() {
        return value.get();
    }
    
    /**
     * Decrement the counter atomically using CAS.
     * Same pattern as increment: retry until CAS succeeds.
     */
    public void decrement() {
        int current;
        do {
            current = value.get();
        } while (!value.compareAndSet(current, current - 1));
    }
}

The key pattern is the CAS loop: read the current value, attempt to update it, and retry if another thread modified it in between. This is lock-free (at least one thread makes progress) but not wait-free, as retries are unbounded: a thread could theoretically retry indefinitely if other threads keep modifying the value. Contrast this with Peterson's algorithm, which requires busy-waiting and blocking. CAS gives us the power to build non-blocking algorithms that scale under contention.

The practical impact explains the hardware landscape we inhabit today. Every modern processor architecture, from x86's CMPXCHG to ARM's LDREX/STREX to RISC-V's LR/SC to SPARC's CAS, provides compare-and-swap or its equivalent precisely because universality isn't just theoretical elegance: it's engineering necessity. When designing a processor, you could provide dozens of specialized atomic instructions for different data structures. Or you could provide one universal primitive and let software build everything else. The consensus hierarchy proved that some primitives are fundamentally insufficient. The universality theorem proved that CAS is fundamentally sufficient. This is why CAS became the assembly language of concurrency: not through committee decision or vendor preference, but through mathematical inevitability. If your hardware provides consensus objects, your software can build anything. And that "anything" includes both the sophisticated lock-free algorithms powering high-performance systems and the simple, correct locks that make everyday programming tractable.

Key Takeaways

Before we conclude, let's summarize the essential insights:

CAS is universal: Any concurrent object that can be specified sequentially can be implemented wait-free using CAS. This isn't just convenient: it's mathematically proven.
Consensus number measures primitive power: Every synchronization primitive has a consensus number: the maximum number of threads for which it can solve consensus wait-free. Higher consensus numbers mean strictly more powerful primitives.
Wait-free consensus for 2+ threads is impossible with read/write alone: This impossibility result explains why Peterson's and Bakery algorithms must block: it's not a design choice, it's a mathematical necessity.
Modern processors provide CAS because it's necessary, not just convenient: Hardware designers recognized that certain problems are literally impossible to solve wait-free without CAS-like primitives. This is why every modern architecture converged on CAS.
The universal construction provides a systematic approach: Rather than inventing clever tricks for each data structure, the universal construction gives us a general recipe for building wait-free concurrent objects from consensus primitives.
Performance trade-offs matter: CAS excels under low contention but can suffer from cache line bouncing under high contention. The choice between wait-free, lock-free, and blocking approaches depends on your specific requirements.

Conclusion

The journey from Peterson's algorithm to universal constructions isn't just a historical progression: it's a logical proof that unfolded over decades. Each step builds on the previous, moving from concrete examples to abstract principles, from intuitive algorithms to mathematical impossibility results, and finally to the profound realization that one primitive can serve as the foundation for all concurrent programming.

This theoretical foundation has profound practical implications. When you reach for a concurrent data structure library like Java's java.util.concurrent, you're benefiting from algorithms built on CAS. When you debate lock-free versus locked implementations, you're weighing trade-offs that the consensus hierarchy makes precise. When you evaluate whether your architecture provides adequate synchronization support, you're applying insights that explain why every modern processor provides CAS.

For practicing engineers, understanding the consensus hierarchy provides a framework for making informed decisions:

Choose CAS-based algorithms when you need progress guarantees and can tolerate some retry overhead
Understand the limitations of read/write operations: they can solve mutual exclusion but require blocking
Recognize that CAS universality means you can build any concurrent object, but optimized implementations often outperform the universal construction
Appreciate why hardware matters: processors provide CAS not as a convenience, but as a necessity for certain classes of problems

As concurrent systems continue to scale, from multi-core processors to distributed systems spanning continents, the principles established by the consensus hierarchy remain foundational. CAS isn't just another instruction in the processor's repertoire. It's the universal building block that makes modern concurrent systems possible, and understanding why it's universal helps us build better systems for the future.

Bonus: Implementing a Lock with CAS

To make the universality of CAS concrete, let's implement a simple spin lock using only CAS operations. This demonstrates how CAS can build the fundamental synchronization primitive, mutual exclusion, that we started with.

A lock needs to track whether it's currently held. We'll use an AtomicInteger where 0 means unlocked and 1 means locked. The lock() method must atomically check if the lock is 0 and set it to 1 if so. The unlock() method simply sets it back to 0.

import java.util.concurrent.atomic.AtomicInteger;

public class CASLock {
    private final AtomicInteger state = new AtomicInteger(0); // 0 = unlocked, 1 = locked
    
    /**
     * Acquire the lock by atomically transitioning from unlocked (0) to locked (1).
     * Spins until successful - this is lock-free (at least one thread makes progress)
     * but not wait-free (a thread may spin indefinitely).
     */
    public void lock() {
        // Keep trying until we successfully change state from 0 to 1
        while (!state.compareAndSet(0, 1)) {
            // Lock is held by another thread - spin (busy-wait)
            // In production, you might add Thread.yield() or exponential backoff
        }
        // We successfully acquired the lock
    }
    
    /**
     * Release the lock by setting state back to unlocked (0).
     * This is wait-free - always completes in one step.
     */
    public void unlock() {
        // Simply set state back to 0
        // No CAS needed - only the lock holder calls unlock()
        state.set(0);
    }
    
    /**
     * Try to acquire the lock without blocking.
     * Returns true if lock was acquired, false otherwise.
     */
    public boolean tryLock() {
        return state.compareAndSet(0, 1);
    }
}

How it works:

The lock() method uses CAS in a retry loop: it attempts to atomically change the state from 0 (unlocked) to 1 (locked). If another thread already holds the lock, the CAS fails (because state is already 1), and the thread retries. Only one thread can successfully transition from 0 to 1, ensuring mutual exclusion.

Why this matters:

This implementation demonstrates CAS's power in a concrete way. We've built mutual exclusion, the problem Peterson's algorithm solved with read/write operations, using CAS. Unlike Peterson's algorithm, this lock:

Works for any number of threads (not just two)
Uses a single memory location (not multiple flags and turn variables)
Is simpler to understand and reason about

However, this is a spin lock: threads busy-wait when the lock is held. In practice, production locks combine CAS with OS-level blocking primitives (like futex on Linux) to avoid wasting CPU cycles. But the core mechanism, using CAS to atomically transition between states, remains the same.

The deeper insight:

This lock implementation is lock-free (at least one thread always makes progress) but not wait-free (individual threads may spin indefinitely). To build a wait-free lock, you'd need more sophisticated techniques, but the universality theorem guarantees such implementations exist: CAS provides sufficient power to build them.

This simple example illustrates why CAS is universal: if you can build locks with CAS, and locks can build any synchronization primitive, then CAS can build anything. The universal construction provides the general recipe; this lock is a concrete, practical example of CAS's power.

References

Herlihy, M., & Shavit, N. (2012). The Art of Multiprocessor Programming (Revised First Edition). Morgan Kaufmann. Amazon
Herlihy, M. (1991). Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13(1), 124-149. This paper introduced the concept of wait-freedom and the universal construction.
Herlihy, M. (1991). Impossibility and universality results for wait-free synchronization. Proceedings of the seventh annual ACM symposium on Principles of distributed computing, 276-290. This paper established the consensus hierarchy and universality results.
Java java.util.concurrent package: Real-world implementations of CAS-based concurrent data structures. Documentation
Peterson, G. L. (1981). Myths about the mutual exclusion problem. Information Processing Letters, 12(3), 115-116. The original Peterson's algorithm paper.
Lamport, L. (1974). A new solution of Dijkstra's concurrent programming problem. Communications of the ACM, 17(8), 453-455. The bakery algorithm.

The Synchronization Problem​

Defining Correctness​

Primitive Power Hierarchy​

Universal Solution​

Performance Implications​

Key Takeaways​

Conclusion​

Bonus: Implementing a Lock with CAS​

References​