A Decade of ClickHouse in the Open Source Ecosystem

On June 15, 2016, ClickHouse was officially released to the open-source community. Fast forward ten years, and it has evolved into the premier open-source analytical database, boasting a community of over 2,000 contributors.

The Philosophy of "Building in the Open"

Not all open-source projects are created equal. There is a spectrum of transparency and collaboration that can be categorized into distinct levels:

Level	Designation	Description	Examples
0	Read-Only	Code is public for archival purposes, but stagnant.	`Doom`, `MS-DOS`
1	Public Stream	Updates occur in public repos, but external help isn't sought.	Various proprietary-led projects
2	Passive Acceptance	Contributions are taken, but the process is opaque.	Semi-open projects
3	Fully Transparent	Open guidelines, trackers, roadmaps, CI, and support.	ClickHouse

ClickHouse strives to be the gold standard for Level 3. If you are aspiring to architect a new database, the ClickHouse codebase and its operational workflows serve as a primary blueprint.

"I always write the code so everyone can learn from it—by keeping it modular, orthogonal, and well-documented."

To ensure accessibility, complex concepts are explained within the comments from the ground up. The goal is to eliminate the need for developers to constantly pivot to Wikipedia, textbooks, or AI tools.

The Frontier of Software Engineering

ClickHouse is more than a database; it is a living laboratory for C++ development. It covers both the cutting-edge and the fundamental:

The Exciting: Implementation of C++23 features.
The Essential: Build systems and rigorous CI/CD.
The Disciplined: Strict code review practices.
The Modern: Integration of AI.

It serves as a playground for performance optimization. Developers are encouraged to submit Pull Requests as experiments, even if they aren't intended for merging. Whether it's a new $\text{hash table}$ , a novel $\text{compression library}$ , or a unique $\text{memory allocator}$ , the project provides a high-scrutiny environment to test these ideas.

The roadmap isn't just for production features; it explicitly includes sections for ~~boring~~ experimental, weird, and ridiculous ideas.

Recognition and Collaboration

ClickHouse believes in crediting every hand that touches the code. Recognition is found in:

The public changelog.
The internal system.contributors table.

The team proactively helps contributors finish incomplete features. Even if a total rewrite is necessary, the original author is credited because the intent and the use case are what truly drive innovation.

The Pre-Open Source Era

Prototypes and Early Commits

The journey began on May 29, 2009. The very first commit wasn't a feature, but a performance fix. Alexey was frustrated by the slowness of standard libc functions: localtime, mktime, and gmtime.

At the time, ClickHouse was a side experiment during the development of a web analytics platform (similar to Google Analytics). The original architecture looked like this:

MySQL: Stored pre-aggregated reports.
C++: Handled data processing and custom structures where MySQL failed.

The Struggle with Scale

The primary challenge was the relentless growth of data. The system faced a strict real-time constraint: $\text{Processing Time} \le \text{Data Arrival Window (5 minutes)}$

If the processing lagged, a delay accumulated, forcing a frantic search for creative solutions—often deployed on the same day. This era was defined by rapid experimentation:

Testing LZO and QuickLZ for compression.
Experimenting with storing HyperLogLog structures inside MySQL BLOB fields.
Studying event-loop servers over weekends.

While stabilizing the pipeline, Alexey explored product enhancements, such as:

Heat maps for page clicks.
Click maps using DOM positions.
3D Click Maps (created for April Fools' using Flash and anaglyph colors).

The Pivot to On-the-Fly Aggregation

The ultimate goal shifted from ~~pre-aggregated reports~~ $\rightarrow$ on-the-fly aggregation of structured logs. The idea was to store raw data and aggregate it while the user waited for the page to load.

Several existing solutions were tested but failed to handle the massive load of $100 \times 10^9$ records per day across 500 columns:

Extensions: Infobright, InfiniDB.
Standalone DBs: Vertica, MonetDB, LucidDB.

The First Custom Prototype

Eventually, a custom prototype was built with the following specs:

Storage: One binary file per column, per day, per website (requiring XFS to handle billions of files).
Compression: Lightweight.
Update Cycle: Once daily with a few hours of lag.
Interface: An XML-based API for filters, grouping, and sorting.

<!-- Conceptual example of the early XML query API -->
<query>
    <columns>
        <column>site_id</column>
        <column>page_id</column>
    </columns>
    <aggregate>count</aggregate>
    <filter>date = '2009-05-29'</filter>
</query>

The most grueling task of this phase was the "unaggregation" process—extracting historical data from MySQL to populate this new structured format.

Ten years of ClickHouse in open source