OmniNode Protocol: Pipeline-Parallel LLM Inference Across Consumer Devices via Rust, QUIC, and Native GGUF-to-MLX

Learn how to split LLMs across consumer devices for inference using Rust, QUIC, and native GGUF-to-MLX loading on Apple Silicon.

Overview

A live, two-machine demo of OmniNode Protocol: an open-source Rust/Python system that splits a large language model across multiple consumer devices and runs real autoregressive inference over a LAN.

We built four layers from scratch:

P2P Transport (omni-net): mDNS peer discovery + encrypted QUIC streams via libp2p 0.55. All tensor routing uses a custom /omni/tensor-xfer/1 request-response protocol, no centralized broker.

GGUF Model Sharding (omni-store): A zero-copy GGUF v2/v3 parser (memmap2) that classifies tensors by name (token_embd., blk.{N}., output.*), chunks them by transformer block, and content-addresses each shard with BLAKE3 → CIDv1. No iroh dependency, we built the 64 MiB sliding-window transfer protocol directly on libp2p request-response.

PyO3 FFI Bridge (omni-bridge): A Rust-to-Python zero-copy bridge using the Python Buffer Protocol. PyShardView implements getbuffer over memmap2::Mmap, exposing raw shard bytes to NumPy and MLX with zero memory copies on Apple Silicon unified memory.

Pipeline-Parallel Inference + Native GGUF Bridge: Each node loads only its assigned layer slice from the .gguf file using Apple’s mx.load() API directly, no HuggingFace Hub, no config.json, no mlx_lm.load(). Architecture (hidden_dim, layer count, rope_theta, etc.) is inferred entirely from GGUF metadata at runtime. After slicing, the full model is explicitly dropped (del model + gc.collect() + mx.metal.clear_cache()) so each node holds only its 50% of weights in Apple Silicon VRAM. Hidden states are routed over QUIC using hidden_dim as a wire-level type discriminator: hidden_dim == model_size means float16 activations; hidden_dim == 1 means a 4-byte token ID.

Next step: Phase 5 (omni-zkml): wrap each pipeline stage in a zk proof (ezkl/Halo2 SNARK or RISC Zero STARK) to cryptographically prove correct inference, enabling a trustless staking/slashing economy on SUM Chain.

Video

Links

Tech stack