Computed goto for efficient dispatch tables (2012)

While exploring the Python source code, specifically within the bytecode VM implementation found in Python/ceval.c, I noticed a compelling comment regarding the use of a GCC extension known as computed gotos [1].

Intrigued, I decided to build a basic example to compare the performance of a traditional switch statement against a computed goto within the context of a simple Virtual Machine (VM).

Defining a Simple Bytecode VM

To be clear, the "VM" in this scenario is a Bytecode Interpreter. Essentially, it is a loop that iterates through a sequence of instructions and executes them sequentially.

Rather than using the massive PyEval_EvalFrameEx function from Python (which is thousands of lines long), I've designed a minimal VM. Its state consists of a single integer, and it supports a few basic manipulation instructions. Despite its simplicity, the architecture mirrors that of professional VMs.

VM Instruction Set

The VM supports the following operations:

Implementation via `switch`

The following code represents a standard C implementation:

#define OP_HALT 0x0
#define OP_INC 0x1
#define OP_DEC 0x2
#define OP_MUL2 0x3
#define OP_DIV2 0x4
#define OP_ADD7 0x5
#define OP_NEG 0x6

int interp_switch ( unsigned char * code, int initval) {
    int pc = 0 ;
    int val = initval;
    while ( 1 ) {
        switch (code[pc++]) {
            case OP_HALT: return val;
            case OP_INC: val++; break ;
            case OP_DEC: val--; break ;
            case OP_MUL2: val *= 2 ; break ;
            case OP_DIV2: val /= 2 ; break ;
            case OP_ADD7: val += 7 ; break ;
            case OP_NEG: val = -val; break ;
            default: return val;
        }
    }
}

In this model, an infinite loop processes the instruction stream, and the switch block determines the action based on the opcode. While the pc (program counter) currently moves linearly, adding flow-control instructions would be straightforward. Compilers typically optimize switch statements using a lookup table for jumps.

Computed Gotos

There is a GCC extension that can potentially outperform the standard switch. Computed gotos combine two specific capabilities:

Label Addressing: The ability to assign a label to a pointer. void * labeladdr = somelabel;
Indirect Jumping: The ability to goto a variable expression. goto *table[pc];

For those familiar with assembly, this is simply a high-level exposure of the "jump through a register" instruction common in modern CPUs.

Implementation via Computed Goto

Here is the same VM rewritten using this extension [2]:

int interp_cgoto ( unsigned char * code, int initval) {
    /* The indices of labels in the dispatch_table are the relevant opcodes */
    static void * dispatch_table[] = { 
        do_halt, do_inc, do_dec, do_mul2, do_div2, do_add7, do_neg
    };

    #define DISPATCH() goto *dispatch_table[code[pc++]]

    int pc = 0 ;
    int val = initval;

    DISPATCH();
    while ( 1 ) {
        do_halt: return val;
        do_inc: val++; DISPATCH();
        do_dec: val--; DISPATCH();
        do_mul2: val *= 2 ; DISPATCH();
        do_div2: val /= 2 ; DISPATCH();
        do_add7: val += 7 ; DISPATCH();
        do_neg: val = -val; DISPATCH();
    }
}

Benchmarking and Analysis

Using random opcodes, I found that the computed goto version was approximately $\approx 25\%$ faster than the switch version. While results vary based on data, CPython reports a speed increase of $15\text{--}20\%$ , which aligns with my findings.

Comparison Summary

Feature	`switch` Statement	Computed Goto
C Standard	Standard C	GCC Extension
Jump Logic	Centralized jump point	Distributed jump points
Overhead	Includes bounds checking	Minimal overhead
Performance	Baseline	$\approx 25\%$ faster (in tests)

Why is Computed Goto Faster?

The performance gain stems from two primary factors:

1. Reducing Per-Iteration Overhead

The switch statement performs more work per opcode.

Switch Workflow:

Execute operation.
Perform bounds check.
Jump via table based on code[pc].

Computed Goto Workflow:

Execute operation.
pc++.
Jump via table based on code[pc].

~~You might assume the bounds check is only there because of the default case~~, but that is incorrect. The C99 standard mandates this safety:

If no converted case constant expression matches and there is no default label, no part of the switch body is executed.

Because the compiler must guarantee this behavior, it generates a bounds check regardless of whether a default label exists.

2. Branch Prediction

Modern CPUs utilize deep pipelines to maintain efficiency. Branches can disrupt this flow, which is why CPUs use branch predictors.

(The diagram above illustrates the centralized bottleneck of a switch statement)

The computed goto approach allows the CPU to predict the next jump more effectively because the jump occurs at the end of each specific instruction handler rather than at one single central point.

Bonus Material

For the low-level enthusiasts, I have included annotated disassembly of both functions (compiled with -O3) further down in the original post.