Skip to main content

Firecracker Internals

· 21 min read
Arun Lakshman Ravichandran
Software Engineer, AWS

If you've used AWS Lambda or Fargate, your code ran inside Firecracker. Not a container. Not a traditional VM. A microVM - a lightweight virtual machine that boots in ~125 milliseconds, uses about 5 MiB of memory overhead, and provides the hard security boundary of hardware virtualization.

Firecracker was open-sourced by AWS in 2018, and the NSDI '20 paper revealed the engineering decisions behind it. But most engineers interact with it indirectly - through Lambda invocations or Fargate tasks - without understanding what's happening underneath.

This article is a deep dive into Firecracker's internals. We'll walk through the full virtualization stack - from KVM ioctls to VirtIO virtqueues - and build a working microVM from scratch along the way. The goal is to give you a mental model of how modern lightweight virtualization actually works, not just what it is, but why each design decision was made.

S3 Is Not an Object Store. It's a Consensus Store.

· 14 min read
Arun Lakshman Ravichandran
Software Engineer, AWS

Distributed coordination requires locks, leader election, and consistent configuration. The standard approach: deploy etcd, ZooKeeper, or provision a DynamoDB table with conditional expressions. Teams treat S3 as file storage.

Object stores now support consensus primitives.

Google Cloud Storage (GCS) has supported conditional writes via generation-match preconditions since its early releases. Azure Blob Storage has supported If-Match / If-None-Match on ETags for the same duration. S3 was the last to adopt. In May 2024, materializedview.io identified the gap: "S3 has no compare-and-swap operation, something every single other competitor has."

S3 closed the gap in two releases. If-None-Match shipped in August 2024. Full If-Match Compare-And-Swap (CAS) shipped in November 2024. AWS described the feature as "reliably offloading compare and swap operations to S3."

All three major object stores now support CAS. Herlihy proved in 1991 that CAS is a universal primitive: the single operation sufficient to build any concurrent data structure, wait-free.

Every major object store provides a universal coordination primitive. Most teams have not recognized this.

AWS EC2 : What's Running Underneath?

· 17 min read
Arun Lakshman Ravichandran
Software Engineer, AWS

Every developer who's worked with AWS has launched an EC2 instance. You pick an instance type, choose an AMI, SSH in, and deploy your app. Somewhere in the back of your mind, you know there's virtualization happening. But that's where most people stop thinking about it.

Here's what might surprise you: when AWS launched EC2 in August 2006, every instance ran on Xen - an open-source Type 1 bare-metal hypervisor originally created by Ian Pratt and Keir Fraser at the University of Cambridge in 2003. Then, starting around 2017 with the C5 instance family, AWS began migrating to Nitro: a custom platform built on KVM, which is a Type 2 hosted hypervisor. In the textbook hierarchy, Type 1 sits closer to hardware and is considered superior. So why would AWS move down a tier?

The answer is that the Type 1 vs Type 2 distinction is misleading. What actually matters is where I/O is handled. And Nitro solved that problem in dedicated hardware, making the hypervisor classification almost irrelevant.

EC2 Instance Types: A Complete Guide to Choosing the Right Compute

· 19 min read
Arun Lakshman Ravichandran
Software Engineer, AWS

Every workload on AWS starts with a choice: which EC2 instance type? Pick wrong and you overpay for idle resources or starve your application. AWS offers hundreds of instance types across families, generations, and sizes. This guide breaks down the naming convention, walks through every family, explains the Nitro system, and covers the newer Flex instances.

Inside Flink's Control Plane: How Apache Pekko Powers the RPC Layer

· 21 min read
Arun Lakshman Ravichandran
Software Engineer, AWS

Flink's distributed components must communicate constantly. TaskManagers report task state changes to JobMaster. JobMaster requests slots from ResourceManager. Dispatchers serve REST API queries about job status. All these components access shared state, particularly the ExecutionGraph. Traditional multi-threading with locks would create race conditions, deadlocks, and unmaintainable code. Flink solves this by adopting the Actor Model through the Akka/Pekko framework. Each component processes all requests on a single thread through a FIFO mailbox. This design eliminates concurrency bugs by architecture, not by locks.

The Universal Primitive - How CAS Became the Foundation of Concurrent Programming

· 31 min read
Arun Lakshman Ravichandran
Software Engineer, AWS

This blog post is inspired by the first 6 chapters of The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit.

Imagine building a distributed counter that must handle millions of updates per second across dozens of threads. Traditional locks serialize access, creating bottlenecks. You need something better: a way for threads to coordinate without blocking, without deadlocks, without the performance collapse that comes with contention. This isn't just a performance optimization problem; it's a fundamental question about what synchronization primitives are actually necessary. Can we build wait-free concurrent data structures? Which hardware instructions must processors provide? The answer, discovered through decades of theoretical work, reveals that one primitive, Compare-And-Swap (CAS), is universal.