Epoll vs. io_uring in Linux
Epoll vs. io_uring in Linux
Author's Note: I want to share the journey of how I ended up researching asynchronous I/O options on Linux. It all started last year when my students and I developed TinyGate, a basic reverse proxy server.
Initially, TinyGate was a simple, worker-based implementation. While it functioned correctly, it wasn't designed for high performance—it was primarily an educational exercise. I was quite proud of creating a tool that felt "production-ready," but my students had higher ambitions. They were frustrated that our architecture had inherent bottlenecks that prevented it from competing with industry giants like haproxy or nginx.
This pushed us to dive deep into the internals of those high-performance tools to understand how to minimize overhead. Our evolution looked like this:
-
Version 1: Basic Worker-based (Slow) -
Version 2: epoll-based (Significant boost, but still trailing) - Version 3: io_uring-based (Full rewrite for maximum efficiency)
Today, I will break down these two Linux queuing systems for asynchronous I/O.
The Legacy of epoll
When I first began Linux development, epoll was the gold standard. For a long time, it was essentially the only viable choice for managing asynchronous execution.
The fundamental issue with epoll is its heavy reliance on system calls (syscalls). It operates on a readiness model: it notifies you that an I/O operation is possible, but it doesn't do the work for you. You must still manually invoke read() or write().
The Overhead Problem
Every time a syscall is made, the CPU must perform a context switch between user mode and kernel mode.
The cost per I/O event can be represented as:
When handling thousands of concurrent connections, these context switches create massive overhead.
The Arrival of io_uring
Introduced in 2019 (roughly 17 years after epoll), io_uring changed the game. Instead of a readiness model, it uses a completion model. It doesn't tell you when you can do I/O; it tells you when the I/O is already finished.
How it Works
io_uring utilizes two ring buffers shared between the application and the kernel:
- Submission Queue (SQ): Where the app posts requests.
- Completion Queue (CQ): Where the kernel posts results.
![Architecture Diagram Placeholder: A diagram showing two circular buffers (SQ and CQ) with arrows moving from User Space to Kernel Space and back]
While you typically call io_uring_enter() to notify the kernel to process the SQ, a single call can submit and reap batches of operations. This drastically reduces the number of context switches.
Pro Tip: For those seeking near-zero syscalls, IORING_SETUP_SQPOLL creates a dedicated kernel thread to poll the submission queue automatically. Note: This consumes more CPU as the thread spins.
Side-by-Side Comparison
| Feature | epoll | io_uring |
|---|---|---|
| Model | Readiness (Can I read?) | Completion (I have read!) |
| Syscall Frequency | High (Per operation) | Low (Per batch) |
| Kernel Boundary | Crossed frequently | Crossed rarely |
| Complexity | Moderate | Higher (Architectural shift) |
| Availability | Ancient (Kernel 2.5.44+) | Modern (Kernel 5.1+) |
On modern systems, there is rarely a reason to choose epoll over io_uring. The shift effectively moves the heavy lifting from the application layer into the kernel.
Implementation Examples
Below are examples of how to handle stdin using both methods. For io_uring, I've used liburing (the helper library) to keep the code readable.
1. The epoll Approach
#include <stdio.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <stdlib.h>
#define MAX_EVENTS 8
int main() {
// Create the epoll instance
int epoll_fd = epoll_create1(0);
if (epoll_fd == -1) {
perror("epoll_create1");
return 1;
}
// Register stdin (STDIN_FILENO)
struct epoll_event ev, events[MAX_EVENTS];
ev.events = EPOLLIN;
ev.data.fd = STDIN_FILENO;
if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, STDIN_FILENO, &ev) == -1) {
perror("epoll_ctl");
return 1;
}
// Block until an event occurs
int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
if (n == -1) {
perror("epoll_wait");
return 1;
}
// Perform the actual I/O via a separate syscall
for (int i = 0; i < n; i++) {
if (events[i].data.fd == STDIN_FILENO) {
char buf[256];
ssize_t count = read(STDIN_FILENO, buf, sizeof(buf));
printf("read %zd bytes\n", count);
}
}
close(epoll_fd);
return 0;
}
Analysis: This requires epoll_ctl for setup, then a pair of epoll_wait and read for every single event.
2. The io_uring Approach
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>
#include <stdlib.h>
int main() {
struct io_uring ring;
char buf[256];
// Initialize the ring buffer
if (io_uring_queue_init(8, &ring, 0) < 0) {
perror("io_uring_queue_init");
return 1;
}
// Prepare a READ operation
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, STDIN_FILENO, buf, sizeof(buf), 0);
// Submit the request to the kernel
io_uring_submit(&ring);
// Wait for the completion event
struct io_uring_cqe *cqe;
if (io_uring_wait_cqe(&ring, &cqe) < 0) {
perror("io_uring_wait_cqe");
return 1;
}
printf("Read completed with %d bytes\n", cqe->res);
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
return 0;
}
Analysis: The application simply describes the work it wants done and submits it. The kernel handles the execution and notifies the app upon completion.