Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
OmniNode Protocol: Pipeline-Parallel LLM Inference Across Consumer Devices via Rust, QUIC, and Native GGUF-to-MLX
Learn how to split LLMs across consumer devices for inference using Rust, QUIC, and native GGUF-to-MLX loading on Apple Silicon.
A live, two-machine demo of OmniNode Protocol: an open-source Rust/Python system that splits a large language model across multiple consumer devices and runs real autoregressive inference over a LAN.
We built four layers from scratch:
- P2P Transport (omni-net): mDNS peer discovery + encrypted QUIC streams via libp2p 0.55. All tensor routing uses a custom /omni/tensor-xfer/1 request-response protocol, no centralized broker.
- GGUF Model Sharding (omni-store): A zero-copy GGUF v2/v3 parser (memmap2) that classifies tensors by name (token_embd., blk.{N}., output.*), chunks them by transformer block, and content-addresses each shard with BLAKE3 → CIDv1. No iroh dependency, we built the 64 MiB sliding-window transfer protocol directly on libp2p request-response.
- PyO3 FFI Bridge (omni-bridge): A Rust-to-Python zero-copy bridge using the Python Buffer Protocol. PyShardView implements getbuffer over memmap2::Mmap, exposing raw shard bytes to NumPy and MLX with zero memory copies on Apple Silicon unified memory.
- Pipeline-Parallel Inference + Native GGUF Bridge: Each node loads only its assigned layer slice from the .gguf file using Apple’s
mx.load() API directly, no HuggingFace Hub, no
config.json, no mlx_lm.load(). Architecture (hidden_dim, layer count, rope_theta, etc.) is inferred entirely from GGUF metadata at runtime. After slicing, the full model is explicitly dropped (del model +
gc.collect() +
mx.metal.clear_cache()) so each node holds only its 50% of weights in Apple Silicon VRAM. Hidden states are routed over QUIC using hidden_dim as a wire-level type discriminator: hidden_dim == model_size means float16 activations; hidden_dim == 1 means a 4-byte token ID.
Next step: Phase 5 (omni-zkml): wrap each pipeline stage in a zk proof (ezkl/Halo2 SNARK or RISC Zero STARK) to cryptographically prove correct inference, enabling a trustless staking/slashing economy on SUM Chain.
OmniNode is a decentralized P2P network distributing LLM inference across consumer hardware using Rust, libp2p, and zero-copy GGUF sharding.
P2P Rust protocol sharding LLMs via QUIC and zero-copy MLX.
- RustRust is a high-performance systems programming language that guarantees memory and thread safety via its compile-time ownership model.Rust is a statically-typed systems language engineered for performance and reliability, directly challenging C/C++ in speed. Its core innovation is the ownership model and 'borrow checker,' which enforces strict memory and thread safety at compile-time, eliminating data races and null pointer dereferences without a conventional garbage collector. Rust achieves near-native speed through 'zero-cost abstractions,' allowing high-level features to compile into highly optimized code. Major industry players, including Microsoft and Cloudflare, leverage Rust for critical infrastructure, and it is now officially supported for development in the Linux kernel.
- libp2pA modular peer-to-peer networking stack that handles transport, security, and multiplexing across diverse network environments.libp2p provides the foundational networking layer for decentralized systems like IPFS and Ethereum 2.0. It solves the fragmentation of peer-to-peer connectivity by abstracting protocols (TCP, QUIC, WebRTC) and addressing nodes via content-agnostic multiaddresses. The framework manages complex tasks including NAT traversal, peer discovery (DHT), and pubsub messaging. By decoupling the application logic from the underlying network transport, libp2p enables developers to build resilient, distributed applications that function reliably across browsers, mobile devices, and data centers.
- PyO3PyO3 provides Rust bindings for the Python interpreter, enabling seamless integration of high-performance Rust code into Python projects.PyO3 bridges the gap between Python's flexibility and Rust's memory safety and speed. It allows developers to write native Python modules in Rust or embed a Python interpreter within a Rust binary. The framework handles complex boilerplate (like reference counting and GIL management) automatically through procedural macros like #[pyfunction] and #[pymethods]. By leveraging Rust's zero-cost abstractions, PyO3 powers critical performance layers in industry-standard tools like Polars, Pydantic, and Cryptography.
- MLXMLX is Apple's high-performance array framework for machine learning on Apple silicon, leveraging unified memory for zero-copy efficiency.MLX is an open-source array framework from Apple machine learning research, purpose-built for efficient ML on Apple Silicon (M-series chips). Its core strength is the unified memory model: this eliminates costly data transfers between the CPU and GPU, a major performance bottleneck in traditional frameworks. The API is immediately familiar, closely mirroring NumPy for array operations and PyTorch for higher-level packages like `mlx.nn` and `mlx.optimizers`. It supports Python, C++, C, and Swift bindings, making it highly flexible. Researchers use MLX to quickly train and deploy complex models, with examples including large-scale text generation with LLaMA and image creation via Stable Diffusion.
- GGUFGGUF (GGerganov's Unified Format) is a memory-mapped, single-file binary format for the efficient, quantized deployment of Large Language Models (LLMs).GGUF is the definitive file format for the GGML ecosystem (e.g., llama.cpp), engineered for streamlined LLM inference, especially on resource-constrained hardware. It functions as a single, self-contained binary: consolidating all model weights, metadata, and configuration (like tokenizer details) into one file. This design ensures mmap compatibility (memory-mapping) for rapid, lazy-loading of models like Llama, Mistral, and Phi-3. Crucially, GGUF supports a range of advanced blockwise quantization schemes, such as Q4_K and Q6_K, significantly reducing the memory footprint while maintaining performance.