F3
F3: A Next-Generation Open-Source Data File Format
F3 is an innovative data file format engineered to prioritize extensibility, interoperability, and high efficiency. It aims to solve the structural limitations found in previous-generation formats—such as Apache Parquet—while ensuring the format remains "future-proof" through the integration of embedded WebAssembly (Wasm) decoders.
[!WARNING] Note: This repository serves as a research prototype intended to validate the theoretical frameworks presented in the associated academic paper.
🚀 Core Philosophy & Architecture
Modern data analytics rely heavily on columnar storage. However, formats like Parquet and ORC were designed over a decade ago. The gap between their original design and current hardware/workload requirements often necessitates complete rewrites.
F3 addresses this by implementing a flexible data organization and a general-purpose API. The fundamental logic can be represented as:
System Workflow
By embedding Wasm binaries (which only occupy a few kilobytes), F3 ensures that any platform can decode the data even if a native decoder is missing.
🛠️ Technical Implementation
Build & Installation
The project has been verified on Debian 12 using Intel architecture. To get the prototype running, follow these steps:
- Initialize submodules:
git submodule update --init --recursive - Prepare the environment:
./scripts/setup_debian.sh - Compile the PoC:
cargo build -p fff-poc - Execute unit tests:
cargo test -p fff-poc
Repository Map
The project is organized into several key directories:
| Directory | Purpose |
|---|---|
format | Contains the FlatBuffer definitions for the file structure. |
fff-bench | Houses the micro and end-to-end experiments used in the paper. |
fff-ude* | Implementation of User-Defined-Encoding (UDE) via Wasm. |
scripts / exp_scripts | Automation tools for running experimental benchmarks. |
fff-core / fff-encoding | Core logic and encoding schemes. |
📊 Project Statistics
Language Distribution
The codebase is primarily written in Rust to ensure memory safety and performance.
| Language | Percentage |
|---|---|
Rust | 67.5% |
WebAssembly | 25.3% |
C++ | 4.2% |
Shell | 1.7% |
Python | 1.3% |
Key Contributors
Xinyu Zeng |
Ruijun Meng |
Ruihang Xia
🎓 Academic Reference
This project is associated with a paper presented at SIGMOD 2026.
Abstract
Columnar storage is the bedrock of modern analytics. While open formats allow for data sharing, legacy specifications often struggle with modern hardware. F3 introduces a "future-proof" approach, utilizing a general-purpose API and embedded Wasm decoders to eliminate the need for constant format migrations. Evaluations show that F3's layout and Wasm-driven decoding offer significant advantages over state-of-the-art legacy formats.
Citation
If you utilize this research, please use the following BibTeX entry:
@article { zeng2025f3 ,
author = { Zeng, Xinyu and Meng, Ruijun and Prammer, Martin and McKinney, Wes and Patel, Jignesh M. and Pavlo, Andrew and Zhang, Huanchen } ,
title = { F3: The Open-Source Data File Format for the Future } ,
year = { 2025 } ,
issue_date = { September 2025 } ,
publisher = { Association for Computing Machinery } ,
address = { New York, NY, USA } ,
volume = { 3 } ,
number = { 4 } ,
url = { https://doi.org/10.1145/3749163 } ,
doi = { 10.1145/3749163 } ,
journal = { Proc. Data } ,
month = sep,
articleno = { 245 } ,
numpages = { 27 } ,
keywords = { columnar storage, compression, extensibility, file format }
}
Additional Resources:
- License: MIT
- Reproduction: Detailed steps are available in
doc/paper_reproduction.md. - Paper Link: dl.acm.org/doi/10.1145/3749163