cuTile Rust: Enabling Safe, Data-Race-Free GPU Kernels in Rust

Crates.io Build

cuTile Rust (cutile-rs) is a specialized tile-based domain-specific language (DSL) designed to bring the safety and ergonomics of the Rust programming language to GPU kernel development. Its primary goal is to ensure that kernels are memory-safe and free of data races.

🛠️ Technical Architecture

The core innovation of cuTile Rust is the extension of Rust's strict ownership and borrowing rules across the boundary between the CPU (host) and GPU (device).

Ownership Model

Mutable Tensors: These are partitioned into disjoint segments before the kernel is launched to prevent overlapping writes.
Immutable Tensors: These are shared across the GPU.
Launchers: The generated launcher functions maintain ownership while the GPU is processing the workload.

This unified model is versatile enough to handle:

Synchronous execution.
Asynchronous pipelines.
CUDA graph replay.

Compilation Pipeline

The #[cutile::module] macro captures the Rust Abstract Syntax Tree (AST) and embeds it within the host binary. At runtime, this AST is JIT-compiled into a GPU cubin via the CUDA Tile IR.

Note: For developers requiring granular control, local opt-outs from these safety abstractions are available.

⚠️ Project Status

Notice: This is a research project intended to demonstrate the feasibility of safe GPU programming in Rust.

The software is ~~stable~~ in an early stage and undergoing active development. Users should anticipate:

Potential bugs

Incomplete feature sets

Breaking API changes

If you wish to contribute, please refer to the CONTRIBUTING.md file.

🚀 Quick Start

Below is a demonstration of a simple vector addition kernel.

use cutile::prelude::*;

#[cutile::module]
mod kernel {
    use cutile::core::*;

    #[cutile::entry()]
    fn add<const B: i32>(
        z: mut Tensor<f32, { [B] }>, // Exclusive mutable output
        x: Tensor<f32, { [-1] }>,    // Shared read-only input
        y: Tensor<f32, { [-1] }>,    // Shared read-only input
    ) {
        let tx = load_tile_like(x, z);
        let ty = load_tile_like(y, z);
        z.store(tx + ty);
    }
}

fn main() -> Result<(), Error> {
    let x = api::ones::<f32>([1024]);
    let y = api::ones::<f32>([1024]);
    // Partition the output tensor into chunks of 128
    let z = api::zeros::<f32>([1024]).partition([128]);

    // JIT-compile and execute
    let (_z, _x, _y) = kernel::add(z, x, y).sync();
    
    Ok(())
}

How it works:

The Macro: #[cutile::module] converts the add function into a GPU kernel and creates a corresponding host-side launcher.
Execution: The host code defines lazy operations and partitions the output z into $1024 \div 128 = 8$ tiles.
Grid Inference: The launch grid $(8, 1, 1)$ is automatically derived from the partition size.
Safety: The signature ensures z is exclusively mutable while x and y remain read-only.

To try a similar example locally: cargo run -p cutile-examples --example saxpy

📊 Performance & Evaluation

The research paper, Fearless Concurrency on the GPU, highlights the efficiency of cuTile Rust. Testing on the NVIDIA B200 demonstrates that safety does not come at the cost of speed.

Hardware Benchmarks (NVIDIA B200)

Operation	Performance	% of Peak
Element-wise	$7 \text{ TB/s}$	$91\%$ Memory Bandwidth
GEMM (Dense f16)	$2 \text{ PFlop/s}$	$92\%$ Peak

Safety Overhead: Microbenchmarks for persistent GEMM at $M=N=K=8192$ show that safe Rust achieves $2.07 \text{ PFlop/s}$ , which is within 0.3% of the low-level Tile IR implementation.

Grout: Real-world Application

In collaboration with Hugging Face, the team developed Grout, an inference engine for Qwen3.

Qwen3-4B (RTX 5090): $171 \text{ tokens/s}$ (Batch-1 decode).
Qwen3-32B (B200): $82 \text{ tokens/s}$ (Batch-1 decode).

HBM roofline analysis confirms that Grout delivers state-of-the-art performance for memory-bound inference tasks.

📚 Citation & References

If you utilize cuTile Rust in your academic work, please use the following BibTeX entry:

@misc{elibol2026fearlessconcurrencygpu,
  title = { Fearless Concurrency on the GPU },
  author = { Elibol, Melih and Roesch, Jared and Gelado, Isaac and Buehler, Eric and Garland, Michael },
  year = { 2026 },
  eprint = { 2606.15991 },
  archivePrefix = { arXiv },
  primaryClass = { cs.PL },
  url = { https://arxiv.org/abs/2606.15991 }
}

Related Projects:

Grout: A high-performance Qwen 3 inference engine implemented in Rust.