Computed goto for efficient dispatch tables (2012)
Computed goto for efficient dispatch tables (2012)
While exploring the Python source code, specifically within the bytecode VM implementation found in Python/ceval.c, I noticed a compelling comment regarding the use of a GCC extension known as computed gotos [1].
Intrigued, I decided to build a basic example to compare the performance of a traditional switch statement against a computed goto within the context of a simple Virtual Machine (VM).
Defining a Simple Bytecode VM
To be clear, the "VM" in this scenario is a Bytecode Interpreter. Essentially, it is a loop that iterates through a sequence of instructions and executes them sequentially.
Rather than using the massive PyEval_EvalFrameEx function from Python (which is thousands of lines long), I've designed a minimal VM. Its state consists of a single integer, and it supports a few basic manipulation instructions. Despite its simplicity, the architecture mirrors that of professional VMs.
VM Instruction Set
The VM supports the following operations:
-
OP_HALT: Stop execution. -
OP_INC: Increment value. -
OP_DEC: Decrement value. -
OP_MUL2: Multiply by 2. -
OP_DIV2: Divide by 2. -
OP_ADD7: Add 7. -
OP_NEG: Negate the value.
Implementation via switch
The following code represents a standard C implementation:
#define OP_HALT 0x0
#define OP_INC 0x1
#define OP_DEC 0x2
#define OP_MUL2 0x3
#define OP_DIV2 0x4
#define OP_ADD7 0x5
#define OP_NEG 0x6
int interp_switch ( unsigned char * code, int initval) {
int pc = 0 ;
int val = initval;
while ( 1 ) {
switch (code[pc++]) {
case OP_HALT: return val;
case OP_INC: val++; break ;
case OP_DEC: val--; break ;
case OP_MUL2: val *= 2 ; break ;
case OP_DIV2: val /= 2 ; break ;
case OP_ADD7: val += 7 ; break ;
case OP_NEG: val = -val; break ;
default: return val;
}
}
}
In this model, an infinite loop processes the instruction stream, and the switch block determines the action based on the opcode. While the pc (program counter) currently moves linearly, adding flow-control instructions would be straightforward. Compilers typically optimize switch statements using a lookup table for jumps.
Computed Gotos
There is a GCC extension that can potentially outperform the standard switch. Computed gotos combine two specific capabilities:
- Label Addressing: The ability to assign a label to a pointer.
void * labeladdr = somelabel; - Indirect Jumping: The ability to
gotoa variable expression.goto *table[pc];
For those familiar with assembly, this is simply a high-level exposure of the "jump through a register" instruction common in modern CPUs.
Implementation via Computed Goto
Here is the same VM rewritten using this extension [2]:
int interp_cgoto ( unsigned char * code, int initval) {
/* The indices of labels in the dispatch_table are the relevant opcodes */
static void * dispatch_table[] = {
do_halt, do_inc, do_dec, do_mul2, do_div2, do_add7, do_neg
};
#define DISPATCH() goto *dispatch_table[code[pc++]]
int pc = 0 ;
int val = initval;
DISPATCH();
while ( 1 ) {
do_halt: return val;
do_inc: val++; DISPATCH();
do_dec: val--; DISPATCH();
do_mul2: val *= 2 ; DISPATCH();
do_div2: val /= 2 ; DISPATCH();
do_add7: val += 7 ; DISPATCH();
do_neg: val = -val; DISPATCH();
}
}
Benchmarking and Analysis
Using random opcodes, I found that the computed goto version was approximately faster than the switch version. While results vary based on data, CPython reports a speed increase of , which aligns with my findings.
Comparison Summary
| Feature | switch Statement | Computed Goto |
|---|---|---|
| C Standard | Standard C | GCC Extension |
| Jump Logic | Centralized jump point | Distributed jump points |
| Overhead | Includes bounds checking | Minimal overhead |
| Performance | Baseline | faster (in tests) |
Why is Computed Goto Faster?
The performance gain stems from two primary factors:
1. Reducing Per-Iteration Overhead
The switch statement performs more work per opcode.
Switch Workflow:
- Execute operation.
- Perform bounds check.
- Jump via table based on
code[pc].
Computed Goto Workflow:
- Execute operation.
pc++.- Jump via table based on
code[pc].
You might assume the bounds check is only there because of the , but that is incorrect. The C99 standard mandates this safety:default case
If no converted case constant expression matches and there is no default label, no part of the switch body is executed.
Because the compiler must guarantee this behavior, it generates a bounds check regardless of whether a default label exists.
2. Branch Prediction
Modern CPUs utilize deep pipelines to maintain efficiency. Branches can disrupt this flow, which is why CPUs use branch predictors.
(The diagram above illustrates the centralized bottleneck of a switch statement)
The computed goto approach allows the CPU to predict the next jump more effectively because the jump occurs at the end of each specific instruction handler rather than at one single central point.

Bonus Material
For the low-level enthusiasts, I have included annotated disassembly of both functions (compiled with -O3) further down in the original post.
