Show HN: cuTile Rust: Safe, data-race-free GPU kernels in Rust
cuTile Rust: Enabling Safe, Data-Race-Free GPU Kernels in Rust
cuTile Rust (cutile-rs) is a specialized tile-based domain-specific language (DSL) designed to bring the safety and ergonomics of the Rust programming language to GPU kernel development. Its primary goal is to ensure that kernels are memory-safe and free of data races.
🛠️ Technical Architecture
The core innovation of cuTile Rust is the extension of Rust's strict ownership and borrowing rules across the boundary between the CPU (host) and GPU (device).
Ownership Model
- Mutable Tensors: These are partitioned into disjoint segments before the kernel is launched to prevent overlapping writes.
- Immutable Tensors: These are shared across the GPU.
- Launchers: The generated launcher functions maintain ownership while the GPU is processing the workload.
This unified model is versatile enough to handle:
- Synchronous execution.
- Asynchronous pipelines.
- CUDA graph replay.
Compilation Pipeline
The #[cutile::module] macro captures the Rust Abstract Syntax Tree (AST) and embeds it within the host binary. At runtime, this AST is JIT-compiled into a GPU cubin via the CUDA Tile IR.
Note: For developers requiring granular control, local opt-outs from these safety abstractions are available.
⚠️ Project Status
Notice: This is a research project intended to demonstrate the feasibility of safe GPU programming in Rust.
The software is
stablein an early stage and undergoing active development. Users should anticipate:
- Potential bugs
- Incomplete feature sets
- Breaking API changes
If you wish to contribute, please refer to the CONTRIBUTING.md file.
🚀 Quick Start
Below is a demonstration of a simple vector addition kernel.
use cutile::prelude::*;
#[cutile::module]
mod kernel {
use cutile::core::*;
#[cutile::entry()]
fn add<const B: i32>(
z: mut Tensor<f32, { [B] }>, // Exclusive mutable output
x: Tensor<f32, { [-1] }>, // Shared read-only input
y: Tensor<f32, { [-1] }>, // Shared read-only input
) {
let tx = load_tile_like(x, z);
let ty = load_tile_like(y, z);
z.store(tx + ty);
}
}
fn main() -> Result<(), Error> {
let x = api::ones::<f32>([1024]);
let y = api::ones::<f32>([1024]);
// Partition the output tensor into chunks of 128
let z = api::zeros::<f32>([1024]).partition([128]);
// JIT-compile and execute
let (_z, _x, _y) = kernel::add(z, x, y).sync();
Ok(())
}
How it works:
- The Macro:
#[cutile::module]converts theaddfunction into a GPU kernel and creates a corresponding host-side launcher. - Execution: The host code defines lazy operations and partitions the output
zinto tiles. - Grid Inference: The launch grid is automatically derived from the partition size.
- Safety: The signature ensures
zis exclusively mutable whilexandyremain read-only.
To try a similar example locally:
cargo run -p cutile-examples --example saxpy
📊 Performance & Evaluation
The research paper, Fearless Concurrency on the GPU, highlights the efficiency of cuTile Rust. Testing on the NVIDIA B200 demonstrates that safety does not come at the cost of speed.
Hardware Benchmarks (NVIDIA B200)
| Operation | Performance | % of Peak |
|---|---|---|
| Element-wise | Memory Bandwidth | |
| GEMM (Dense f16) | Peak |
Safety Overhead: Microbenchmarks for persistent GEMM at show that safe Rust achieves , which is within 0.3% of the low-level Tile IR implementation.
Grout: Real-world Application
In collaboration with Hugging Face, the team developed Grout, an inference engine for Qwen3.
- Qwen3-4B (RTX 5090): (Batch-1 decode).
- Qwen3-32B (B200): (Batch-1 decode).
HBM roofline analysis confirms that Grout delivers state-of-the-art performance for memory-bound inference tasks.
📚 Citation & References
If you utilize cuTile Rust in your academic work, please use the following BibTeX entry:
@misc{elibol2026fearlessconcurrencygpu,
title = { Fearless Concurrency on the GPU },
author = { Elibol, Melih and Roesch, Jared and Gelado, Isaac and Buehler, Eric and Garland, Michael },
year = { 2026 },
eprint = { 2606.15991 },
archivePrefix = { arXiv },
primaryClass = { cs.PL },
url = { https://arxiv.org/abs/2606.15991 }
}
Related Projects:
- Grout: A high-performance Qwen 3 inference engine implemented in Rust.