Ten years of ClickHouse in open source
A Decade of ClickHouse in the Open Source Ecosystem
On June 15, 2016, ClickHouse was officially released to the open-source community. Fast forward ten years, and it has evolved into the premier open-source analytical database, boasting a community of over 2,000 contributors.
The Philosophy of "Building in the Open"
Not all open-source projects are created equal. There is a spectrum of transparency and collaboration that can be categorized into distinct levels:
| Level | Designation | Description | Examples |
|---|---|---|---|
| 0 | Read-Only | Code is public for archival purposes, but stagnant. | Doom, MS-DOS |
| 1 | Public Stream | Updates occur in public repos, but external help isn't sought. | Various proprietary-led projects |
| 2 | Passive Acceptance | Contributions are taken, but the process is opaque. | Semi-open projects |
| 3 | Fully Transparent | Open guidelines, trackers, roadmaps, CI, and support. | ClickHouse |
ClickHouse strives to be the gold standard for Level 3. If you are aspiring to architect a new database, the ClickHouse codebase and its operational workflows serve as a primary blueprint.
"I always write the code so everyone can learn from it—by keeping it modular, orthogonal, and well-documented."
To ensure accessibility, complex concepts are explained within the comments from the ground up. The goal is to eliminate the need for developers to constantly pivot to Wikipedia, textbooks, or AI tools.
The Frontier of Software Engineering
ClickHouse is more than a database; it is a living laboratory for C++ development. It covers both the cutting-edge and the fundamental:
- The Exciting: Implementation of
C++23features. - The Essential: Build systems and rigorous CI/CD.
- The Disciplined: Strict code review practices.
- The Modern: Integration of AI.
It serves as a playground for performance optimization. Developers are encouraged to submit Pull Requests as experiments, even if they aren't intended for merging. Whether it's a new , a novel , or a unique , the project provides a high-scrutiny environment to test these ideas.
The roadmap isn't just for production features; it explicitly includes sections for boring experimental, weird, and ridiculous ideas.
Recognition and Collaboration
ClickHouse believes in crediting every hand that touches the code. Recognition is found in:
- The public changelog.
- The internal
system.contributorstable.
The team proactively helps contributors finish incomplete features. Even if a total rewrite is necessary, the original author is credited because the intent and the use case are what truly drive innovation.
The Pre-Open Source Era
Prototypes and Early Commits
The journey began on May 29, 2009. The very first commit wasn't a feature, but a performance fix. Alexey was frustrated by the slowness of standard libc functions:
localtime, mktime, and gmtime.
At the time, ClickHouse was a side experiment during the development of a web analytics platform (similar to Google Analytics). The original architecture looked like this:
- MySQL: Stored pre-aggregated reports.
- C++: Handled data processing and custom structures where MySQL failed.
The Struggle with Scale
The primary challenge was the relentless growth of data. The system faced a strict real-time constraint:
If the processing lagged, a delay accumulated, forcing a frantic search for creative solutions—often deployed on the same day. This era was defined by rapid experimentation:
- Testing
LZOandQuickLZfor compression. - Experimenting with storing
HyperLogLogstructures inside MySQLBLOBfields. - Studying event-loop servers over weekends.
While stabilizing the pipeline, Alexey explored product enhancements, such as:
- Heat maps for page clicks.
- Click maps using DOM positions.
- 3D Click Maps (created for April Fools' using Flash and anaglyph colors).
The Pivot to On-the-Fly Aggregation
The ultimate goal shifted from pre-aggregated reports on-the-fly aggregation of structured logs. The idea was to store raw data and aggregate it while the user waited for the page to load.
Several existing solutions were tested but failed to handle the massive load of records per day across 500 columns:
- Extensions: Infobright, InfiniDB.
- Standalone DBs: Vertica, MonetDB, LucidDB.
The First Custom Prototype
Eventually, a custom prototype was built with the following specs:
- Storage: One binary file per column, per day, per website (requiring
XFSto handle billions of files). - Compression: Lightweight.
- Update Cycle: Once daily with a few hours of lag.
- Interface: An XML-based API for filters, grouping, and sorting.
<!-- Conceptual example of the early XML query API -->
<query>
<columns>
<column>site_id</column>
<column>page_id</column>
</columns>
<aggregate>count</aggregate>
<filter>date = '2009-05-29'</filter>
</query>
The most grueling task of this phase was the "unaggregation" process—extracting historical data from MySQL to populate this new structured format.