Epoll vs. io_uring in Linux

Author's Note: I want to share the journey of how I ended up researching asynchronous I/O options on Linux. It all started last year when my students and I developed TinyGate, a basic reverse proxy server.

Initially, TinyGate was a simple, worker-based implementation. While it functioned correctly, it wasn't designed for high performance—it was primarily an educational exercise. I was quite proud of creating a tool that felt "production-ready," but my students had higher ambitions. They were frustrated that our architecture had inherent bottlenecks that prevented it from competing with industry giants like haproxy or nginx.

This pushed us to dive deep into the internals of those high-performance tools to understand how to minimize overhead. Our evolution looked like this:

~~Version 1: Basic Worker-based (Slow)~~
~~Version 2: epoll-based (Significant boost, but still trailing)~~
Version 3: io_uring-based (Full rewrite for maximum efficiency)

Today, I will break down these two Linux queuing systems for asynchronous I/O.

The Legacy of `epoll`

When I first began Linux development, epoll was the gold standard. For a long time, it was essentially the only viable choice for managing asynchronous execution.

The fundamental issue with epoll is its heavy reliance on system calls (syscalls). It operates on a readiness model: it notifies you that an I/O operation is possible, but it doesn't do the work for you. You must still manually invoke read() or write().

The Overhead Problem

Every time a syscall is made, the CPU must perform a context switch between user mode and kernel mode.

The cost per I/O event can be represented as: $\text{Total Cost} = \text{epoll\_ctl (once)} + (\text{epoll\_wait} + \text{read/write}) \times \text{events}$

When handling thousands of concurrent connections, these context switches create massive overhead.

The Arrival of `io_uring`

Introduced in 2019 (roughly 17 years after epoll), io_uring changed the game. Instead of a readiness model, it uses a completion model. It doesn't tell you when you can do I/O; it tells you when the I/O is already finished.

How it Works

io_uring utilizes two ring buffers shared between the application and the kernel:

Submission Queue (SQ): Where the app posts requests.
Completion Queue (CQ): Where the kernel posts results.

![Architecture Diagram Placeholder: A diagram showing two circular buffers (SQ and CQ) with arrows moving from User Space to Kernel Space and back]

While you typically call io_uring_enter() to notify the kernel to process the SQ, a single call can submit and reap batches of operations. This drastically reduces the number of context switches.

Pro Tip: For those seeking near-zero syscalls, IORING_SETUP_SQPOLL creates a dedicated kernel thread to poll the submission queue automatically. Note: This consumes more CPU as the thread spins.

Side-by-Side Comparison

Feature	`epoll`	`io_uring`
Model	Readiness (Can I read?)	Completion (I have read!)
Syscall Frequency	High (Per operation)	Low (Per batch)
Kernel Boundary	Crossed frequently	Crossed rarely
Complexity	Moderate	Higher (Architectural shift)
Availability	Ancient (Kernel 2.5.44+)	Modern (Kernel 5.1+)

On modern systems, there is rarely a reason to choose epoll over io_uring. The shift effectively moves the heavy lifting from the application layer into the kernel.

Implementation Examples

Below are examples of how to handle stdin using both methods. For io_uring, I've used liburing (the helper library) to keep the code readable.

1. The `epoll` Approach

#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <stdlib.h>

#define MAX_EVENTS 8

int main() {
    // Create the epoll instance
    int epoll_fd = epoll_create1(0);
    if (epoll_fd == -1) {
        perror("epoll_create1");
        return 1;
    }

    // Register stdin (STDIN_FILENO)
    struct epoll_event ev, events[MAX_EVENTS];
    ev.events = EPOLLIN;
    ev.data.fd = STDIN_FILENO;
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, STDIN_FILENO, &ev) == -1) {
        perror("epoll_ctl");
        return 1;
    }

    // Block until an event occurs
    int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
    if (n == -1) {
        perror("epoll_wait");
        return 1;
    }

    // Perform the actual I/O via a separate syscall
    for (int i = 0; i < n; i++) {
        if (events[i].data.fd == STDIN_FILENO) {
            char buf[256];
            ssize_t count = read(STDIN_FILENO, buf, sizeof(buf));
            printf("read %zd bytes\n", count);
        }
    }

    close(epoll_fd);
    return 0;
}

Analysis: This requires epoll_ctl for setup, then a pair of epoll_wait and read for every single event.

2. The `io_uring` Approach

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>
#include <stdlib.h>

int main() {
    struct io_uring ring;
    char buf[256];

    // Initialize the ring buffer
    if (io_uring_queue_init(8, &ring, 0) < 0) {
        perror("io_uring_queue_init");
        return 1;
    }

    // Prepare a READ operation
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, STDIN_FILENO, buf, sizeof(buf), 0);

    // Submit the request to the kernel
    io_uring_submit(&ring);

    // Wait for the completion event
    struct io_uring_cqe *cqe;
    if (io_uring_wait_cqe(&ring, &cqe) < 0) {
        perror("io_uring_wait_cqe");
        return 1;
    }

    printf("Read completed with %d bytes\n", cqe->res);
    io_uring_cqe_seen(&ring, cqe);
    io_uring_queue_exit(&ring);
    return 0;
}

Analysis: The application simply describes the work it wants done and submits it. The kernel handles the execution and notifies the app upon completion.

Epoll vs. io_uring in Linux

The Legacy of epoll