⚠️ Critical Bug: Codex Logging May Destroy Local SSDs

Reported by: @1996fanrui
Issue Reference: openai/codex #28224
Labels: CLI, bug, performance

🚨 The Problem: Excessive Disk I/O

The Codex application is currently suffering from a severe logging bug where it continuously writes massive amounts of data to a local SQLite feedback database. The affected files are located at:

~/.codex/logs_2.sqlite
~/.codex/logs_2.sqlite-wal
~/.codex/logs_2.sqlite-shm

📉 Impact on Hardware Endurance

The volume of data being written is unsustainable for consumer-grade hardware. Based on real-world observation, a machine with 21 days of uptime recorded approximately 37 TB of writes to the primary SSD.

Using $\LaTeX$ to calculate the annual impact: $\text{Annual Write Volume} = \left( \frac{37\text{ TB}}{21\text{ days}} \right) \times 365\text{ days} \approx 642.38\text{ TB/year}$

Warning: Many consumer SSDs are rated for roughly $600\text{ TBW}$ (Total Bytes Written). At this rate, the software could ~~slowly wear down~~ completely exhaust the drive's warranted write endurance in less than one year.

🔍 Evidence Analysis

Evidence 1: The "Churn" Gap

While the database file size remains relatively small, the internal counters reveal a massive amount of churn (data being written and then deleted).

Metric	Value
Current File Size	`1.2 GiB`
Currently Retained Rows	`506,149`
Total Allocated Row IDs	`5,543,677,486`

There is a 10,000x discrepancy between the rows currently stored and the total number of IDs generated. This suggests that over 10 TB of data has been cycled through the logs, even before considering write amplification from indexes, page rewrites, and filesystem overhead.

Evidence 2: Log Level Distribution

The bulk of the write volume is driven by low-priority telemetry.

Distribution by Level:

TRACE: 70.7% (~732.5 MiB)
INFO: 25.7% (~266.5 MiB)
DEBUG: 3.0% (~30.6 MiB)
WARN: 0.6% (~5.9 MiB)

Primary Offenders (Target + Level):

codex_api::endpoint::responses_websocket (TRACE) $\rightarrow$ 527.4 MiB
codex_otel.log_only (INFO) $\rightarrow$ 141.2 MiB
codex_otel.trace_safe (INFO) $\rightarrow$ 121.2 MiB
log (TRACE) $\rightarrow$ 97.4 MiB
codex_client::transport (TRACE) $\rightarrow$ 60.1 MiB

Note: Filtering out TRACE logs and the specific OpenTelemetry INFO categories would eliminate approximately 96% of the log volume.

📝 Log Sample Analysis

High-Frequency `TRACE` Logs

These logs capture repetitive system events and WebSocket internals.

// Inotify events (extremely frequent)
mask: OPEN, name: Some("ld.so.cache") 37,982x TRACE log: inotify event: ...
mask: OPEN, name: Some("locale.alias") 23,843x TRACE log: inotify event: ...
mask: OPEN, name: Some("passwd") 3,639x TRACE log: inotify event: ...

// WebSocket/Tokio internals
tokio-tungstenite checkout /src/compat.rs:131 AllowStd.with_context 3,505x TRACE log: ...
tokio-tungstenite checkout /src/lib.rs:245 WebSocketStream.with_context 3,362x TRACE log: ...
tokio-tungstenite checkout /src/compat.rs:154 Read.read 3,356x TRACE log: ...

Dominant `INFO` Logs

These consist primarily of mirrored OpenTelemetry events.

843x INFO codex_client::custom_ca: using system root certificates...
334x INFO codex_otel.trace_safe: session_loop{thread_id= redacted }:submission_dispatch...
333x INFO codex_otel.log_only: session_loop{thread_id= redacted }:submission_dispatch...

⚡ Write Amplification

The actual disk pressure is higher than the "retained" database size suggests. In a brief 15-second window, the following was observed:

Metric	Before	After	Delta
Retained Rows	681,774	681,774	0
Max Row ID	5,003,347,015	5,003,383,226	+36,211

This proves that the system is writing tens of thousands of rows every few seconds, only to prune them immediately to keep the database size stable.

✅ Proposed Resolution Path

Disable TRACE level logging by default.
Filter out high-frequency inotify and tokio-tungstenite events.
Reduce the verbosity of codex_otel mirror logs.
Implement a more efficient log rotation or sampling strategy to reduce SSD wear.