80386 Early Start Memory Access

Small Things Retro — Computing and gaming experiments by nand2mario Published: June 23, 2026

When Intel engineered the 80386, they implemented a clever mechanism to mask memory latency known as Early Start.

Rather than ~~waiting for an instruction to reach its specific memory micro-op~~, the 386 initiates the address calculations for the upcoming instruction—including the effective address (EA), segment relocation, and the bus cycle—during the final cycle of the current instruction.

Performance Evolution

The z386 FPGA core (released in May) initially utilized original 386 microcode but lacked this Early Start capability. After a month of refinements and additional optimizations, z386 now competes with ao486 performance levels.

Benchmark Comparison

Benchmark	z386 0.1 (May)	z386 0.4 (June)	ao486
Core Doom (FPS)	16.6	23.0	21.0
3DBench (16-bit)	33.7	44.5	43.8
Landmark	147	170	204

Key Takeaways:

Original Doom (max settings) saw a $\approx 39\%$ increase ( $16.6 \rightarrow 23.0$ ), surpassing ao486.
The 16-bit 3DBench now slightly edges out ao486.
Since the board clock remained constant at 85 MHz, these gains are purely the result of reducing the CPI (Cycles Per Instruction).

Optimization Goals

Implement Early Start logic
Reduce per-instruction cycle counts to $\le$ original 80386
Optimize memory pipeline
Tighten store queue latency

Understanding "Early Start"

As detailed in Slager's 1986 ICCD paper, "Performance Optimizations of the 80386", Early Start allows the CPU to overlap the end of one instruction with the beginning of the next.

Consider the microcode for an ALU instruction that reads from memory (e.g., ADD reg, [mem]):

; ADD/OR/ADC/SBB/AND/SUB/XOR m,r 04A EFLAGS - FLAGSB FLGSBA RD $\leftarrow$ Memory read starts here! 04B DLY 04C OPR_R - TMPB WRITE_RESULT JMP UNL 04D TMPB SRCREG +- |^

Crucially, micro-instruction 04A triggers the RD (Read) signal immediately. No prior micro-instruction is used to calculate the effective address or handle segment limits.

Execution Example

Take the following sequence:

add eax, 16
mov ebx, [eax+4]

The overlap works as follows:

Cycle	Instruction	Action
1	`add eax, 16`	ALU computes $EAX + 16$ ; asserts `RNI` (Run Next Instruction).
2	`add eax, 16`	Write-back: Result saved to `EAX`. <br> Early Start Window: CPU peeks at `MOV`, forwards the new `EAX`, computes $EA = EAX + 4$ , relocates, and issues `RD`.
3	`mov ebx, [eax+4]`	`019`: `RD` microcode executes.
4	`mov ebx, [eax+4]`	`01A`: `DLY` (waiting for data), then write to `OPR_R`.
5	`mov ebx, [eax+4]`	`01B`: `RNI` asserted.
6	`mov ebx, [eax+4]`	`01C`: `OPR_R` moved to `EBX`.

Hazards and the POPAD Bug

The primary challenge is the data hazard: the previous instruction might be writing to a register that the next instruction needs for its address calculation in the same cycle.

To solve this, a forwarding network is used to ensure the Early Start logic sees the most recent value before it is officially committed to the register file.

[!WARNING] The POPAD Bug: The 386DX had a flaw in this forwarding network. When a POPAD is immediately followed by an instruction using [EAX+...], the Early Start mechanism forwards an incorrect value.

Implementing Early Start in z386

In z386, instructions move through a lifecycle defined by two key events:

i_pop: The cycle the instruction is fetched from the prefetch queue (this is the RNI delay slot of the previous instruction).
i_first: The first cycle of the instruction's own microcode.

Early Start occurs at i_pop. The effective address and linear address are computed combinationally.

The Logic

The decoder identifies the base, index, and displacement. The following logic (simplified) handles the bypass:

// Calculate Early Effective Address
wire [31:0] ea_early = calc_ea_core(
    fwd_onehot_gpr(ea_dec_base_sel_r), 
    fwd_onehot_gpr(ea_dec_index_sel_r), 
    ...
);

// Forwarding logic for partial register writes
// FWD_BLO: Byte write to AL
fwd_onehot_gpr = {cur[31:8], dest_value[7:0]}; 
// FWD_W: Word write to AX
fwd_onehot_gpr = {cur[31:16], dest_value[15:0]}; 
// Default: Dword write to EAX
fwd_onehot_gpr = dest_value;

The ea_early value is then stored in ea_reg at i_pop, making it available for the load/store microcode at i_first. This mirrors the 80386's hardwired EA generator and intentionally reproduces the POPAD bug to maintain architectural parity.

Timing Challenges

The path from the forwarded GPR $\rightarrow$ Effective Address $\rightarrow$ Segment Relocation $\rightarrow$ Linear Address is a timing hotspot. It required multiple iterations to ensure this logic fit within the 85 MHz clock budget.

Further Memory Optimizations

While Early Start provided a $\approx 9\%$ boost, reaching a $>30\%$ improvement required shortening the memory pipeline, specifically by tightening the store queue.

Typically, CPUs use a store queue to buffer writes to memory to avoid stalling. z386 had a 3-entry queue, but the interface was too conservative, wasting a cycle. By releasing the DLY (delay) micro-op earlier... (article ends)