← Back to news

F3

github.com|200 points|54 comments|by tosh|Jun 23, 2026

F3: A Next-Generation Open-Source Data File Format

F3 is an innovative data file format engineered to prioritize extensibility, interoperability, and high efficiency. It aims to solve the structural limitations found in previous-generation formats—such as Apache Parquet—while ensuring the format remains "future-proof" through the integration of embedded WebAssembly (Wasm) decoders.

[!WARNING] Note: This repository serves as a research prototype intended to validate the theoretical frameworks presented in the associated academic paper.


🚀 Core Philosophy & Architecture

Modern data analytics rely heavily on columnar storage. However, formats like Parquet and ORC were designed over a decade ago. The gap between their original design and current hardware/workload requirements often necessitates complete rewrites.

F3 addresses this by implementing a flexible data organization and a general-purpose API. The fundamental logic can be represented as: F3=Optimized Layout+Metadata+Wasm Decoders\text{F3} = \text{Optimized Layout} + \text{Metadata} + \text{Wasm Decoders}

System Workflow

By embedding Wasm binaries (which only occupy a few kilobytes), F3 ensures that any platform can decode the data even if a native decoder is missing.


🛠️ Technical Implementation

Build & Installation

The project has been verified on Debian 12 using Intel architecture. To get the prototype running, follow these steps:

  • Initialize submodules: git submodule update --init --recursive
  • Prepare the environment: ./scripts/setup_debian.sh
  • Compile the PoC: cargo build -p fff-poc
  • Execute unit tests: cargo test -p fff-poc

Repository Map

The project is organized into several key directories:

DirectoryPurpose
formatContains the FlatBuffer definitions for the file structure.
fff-benchHouses the micro and end-to-end experiments used in the paper.
fff-ude*Implementation of User-Defined-Encoding (UDE) via Wasm.
scripts / exp_scriptsAutomation tools for running experimental benchmarks.
fff-core / fff-encodingCore logic and encoding schemes.

📊 Project Statistics

Language Distribution

The codebase is primarily written in Rust to ensure memory safety and performance.

LanguagePercentage
Rust67.5%
WebAssembly25.3%
C++4.2%
Shell1.7%
Python1.3%

Key Contributors

Xinyu Zeng Xinyu Zeng | Ruijun Meng Ruijun Meng | Ruihang Xia Ruihang Xia


🎓 Academic Reference

This project is associated with a paper presented at SIGMOD 2026.

Abstract

Columnar storage is the bedrock of modern analytics. While open formats allow for data sharing, legacy specifications often struggle with modern hardware. F3 introduces a "future-proof" approach, utilizing a general-purpose API and embedded Wasm decoders to eliminate the need for constant format migrations. Evaluations show that F3's layout and Wasm-driven decoding offer significant advantages over state-of-the-art legacy formats.

Citation

If you utilize this research, please use the following BibTeX entry:

@article { zeng2025f3 , 
 author = { Zeng, Xinyu and Meng, Ruijun and Prammer, Martin and McKinney, Wes and Patel, Jignesh M. and Pavlo, Andrew and Zhang, Huanchen } , 
 title = { F3: The Open-Source Data File Format for the Future } , 
 year = { 2025 } , 
 issue_date = { September 2025 } , 
 publisher = { Association for Computing Machinery } , 
 address = { New York, NY, USA } , 
 volume = { 3 } , 
 number = { 4 } , 
 url = { https://doi.org/10.1145/3749163 } , 
 doi = { 10.1145/3749163 } , 
 journal = { Proc. Data } , 
 month = sep, 
 articleno = { 245 } , 
 numpages = { 27 } , 
 keywords = { columnar storage, compression, extensibility, file format } 
}

Additional Resources: