Private AI R&D: Fleet Routing, Security Evals, and Knowing When Your Models Fail

See how to build a private AI fleet with diverse hardware, self-hosted routing, and continuous security testing, demonstrating real-world OWASP LLM Top 10 risks.

LiteLLM Ollama Hermia Docker Grafana

Overview

I built a heterogeneous LLM inference fleet and evaluation harness for a boutique AI and security consultancy R&D lab — spanning AMD, Nvidia, and Apple Silicon hardware, routed through a self-hosted LiteLLM gateway with named capability lanes, and tested continuously by Hermia, a local-first security eval TUI I wrote specifically because existing tools weren’t built for this use case.

For the demo: live Hermia run against the fleet, Grafana leaderboard updating in real time, and a walkthrough of what the security evals actually test and why the results differ across models.

Links

https://github.com/scottblydotcom
This GitHub profile serves as a placeholder for Scott Bly.
https://scottbly.com
Scott Bly managed secure infrastructure, HIPAA/PCI compliance, and creative production.

Tech stack

LiteLLM

LiteLLM is the unified LLM gateway: call 100+ models (OpenAI, Anthropic, Azure, etc.) using a single, standardized OpenAI-compatible API.

LiteLLM acts as your production-grade LLM gateway, simplifying complex multi-model deployments. It unifies over 100 LLM providers—including OpenAI, Anthropic, and VertexAI—under a single, consistent API call structure (the OpenAI format). This standardization eliminates SDK friction. Key features include the LiteLLM Router for automatic retry and fallback logic across deployments, ensuring high reliability. Additionally, the Proxy Server centralizes cost tracking, allows granular budget setting per virtual key, and provides load balancing, making it essential for ML Platform teams managing scalable, cost-optimized Gen AI applications.

https://litellm.ai

View projects
Ollama

Deploy and run open-source Large Language Models (LLMs) like Llama 3 and Mistral locally on your machine: achieve private, cost-effective AI via a simple command-line interface.

Ollama is the essential tool for running LLMs locally: consider it the Docker for AI models. It packages complex models and dependencies into a single, easy-to-use application for macOS, Linux, and Windows systems. You get immediate access to models like Gemma 2 and DeepSeek-R1 via a straightforward CLI or REST API. This local-first approach guarantees data privacy and security, eliminating cloud dependency and high API costs. Ollama also optimizes performance on consumer hardware using techniques like quantization, ensuring efficient execution even on standard desktops.

https://ollama.com

View projects
Hermia

Hermia is an AI-driven clinical coding platform that automates medical documentation and billing workflows for healthcare providers.

Hermia streamlines the revenue cycle by deploying autonomous agents to audit medical records and assign ICD-10 and CPT codes with 95% accuracy. The system integrates directly with existing EHRs (Electronic Health Records) to eliminate manual entry bottlenecks and reduce claim denials. By processing thousands of charts in minutes, Hermia allows hospital administrators to capture revenue faster while ensuring full compliance with evolving payer regulations.

https://hermia.io

View projects
Docker

Docker is the open-source platform that packages applications and dependencies into standardized, portable containers for consistent execution across any environment.

Docker is the industry-standard containerization platform, enabling developers to build, ship, and run applications efficiently. It uses the Docker Engine (the core runtime) to create lightweight, isolated environments called containers: these units bundle an application’s code, libraries, and configuration. This self-contained approach guarantees consistency, eliminating the 'it works on my machine' problem across development, testing, and production environments (local workstations, cloud, or on-premises). Docker debuted in 2013 and now serves over 20 million developers monthly, simplifying complex workflows like CI/CD and microservices architecture by leveraging tools like Docker Hub for image sharing and Docker Compose for multi-container applications.

https://www.docker.com

View projects
Grafana

Grafana is the open-source platform for operational observability: query, visualize, and alert on metrics, logs, and traces from any data source.

Grafana is your mission control for data visualization and monitoring. This open-source analytics platform connects directly to over 100 data sources (Prometheus, Loki, InfluxDB, MySQL, etc.) without requiring data ingestion. You build dynamic, flexible dashboards using customizable panels (graphs, heatmaps, tables) to transform raw metrics and logs into actionable, real-time insights. The built-in alerting engine allows you to define rules visually and send notifications to critical channels (Slack, PagerDuty, email). Use Grafana to consolidate your entire observability stack: see infrastructure performance, application health, and user behavior in a single pane of glass.

https://grafana.com/

View projects