Curated list of tools, frameworks, and resources for running, building, and deploying AI privately — on-prem, air-gapped, or self-hosted.
Private AI enables you to keep your data, models, and infrastructure under your control, avoiding unnecessary exposure to third parties. This list covers inference runtimes, model management, privacy tools, and more.
- Awesome Private AI
- Contents
- Inference Runtimes & Backends
- Model Management & Serving
- Fine-Tuning & Adapters
- Vector Databases & Embeddings
- Agents & Orchestration
- VS Code Plugins & Extensions
- Privacy, Security & Governance
- Models for Private Deployment
- UI & Interaction Layers
- Datasets & Data Prep
- Learning Resources & Research
- AI Routers & API Aggregators
- Contributing
- License
Engines and frameworks to run LLMs, vision, and multimodal models locally.
- vLLM - High-throughput, low-latency inference engine for LLMs.
- whisper.cpp - C++ port of OpenAI's Whisper automatic speech recognition model, optimized for local, CPU/GPU inference without internet connectivity.
- mlx-lm - Fast, Apple Silicon-optimized LLM inference engine for running models locally and privately.
- Jan - Privacy-first, offline AI assistant and LLM runtime for local, secure inference.
- LM Studio - Cross-platform desktop app for running local LLMs with an easy-to-use interface.
- Cherry Studio - Powerful and customizable cross-platform desktop app for LLM inference with built in web search, RAG, MCP support, and a quick assistant hotkey to summon your LLM from anywhere. Supports a wide variety of providers and OpenAI compatible endpoints for local inference.
- LLM-D - Privacy-first, distributed LLM inference engine for scalable, local deployments.
- Ollama - Local LLM runner with model packaging. Uses llama.cpp backend to serve cautious model defaults.
- llama.cpp - Portable, CPU/GPU-friendly LLM inference, good for GPU + CPU hybrid inference.
- ik_llama.cpp - Fork of llama.cpp with bleeding edge feature implementations and quantization improvements.
- text-generation-inference - Optimized serving stack from Hugging Face.
- GPT4All - Local desktop model runner.
- exo - Run your own AI cluster at home with everyday devices. Dynamic model partitioning across multiple devices like iPhones, Macs, and Linux machines.
- exllama3 - An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs. Use TabbyAPI for an API server.
- tabbyAPI - Official API server for running exllamav2 and exllamav3 models. Aims to be a friendly backend with high customizablity and an idiotmatic OAI compatible API for users.
- YALS (Yet another llamacpp server) - TabbyAPI's sister project, adapted for llama.cpp and GGUF models. Built from the ground up using libllama instead of wrapping llama-server.
- llama-swap - Model swapping for llama.cpp (or any local OpenAPI compatible server).
Tools for hosting, scaling, and versioning AI models privately.
- Ray Serve - Scalable Python model serving.
- Seldon Core - Kubernetes-native model deployment.
- KServe - Serverless model inference on Kubernetes.
- BentoML - Model packaging & serving framework.
- vLLM Production Stack - End-to-end stack for deploying vLLM in production, including orchestration, monitoring, autoscaling, and best practices for private LLM serving.
- OME (Open Model Engine) - Unified, open-source engine for serving, managing, and scaling LLMs and multimodal models privately. Supports sglang, vLLM, and more.
Private workflows for adapting models to your needs.
- LoRA - Low-rank adaptation technique.
- PEFT - Parameter-efficient fine-tuning.
- QLoRA - Memory-efficient LoRA on quantized models.
Private semantic search & retrieval-augmented generation.
- Milvus - Scalable vector database.
- Qdrant - High-performance Vector Database and Vector Search Engine.
- Weaviate - Open-source semantic search engine.
- Chroma - Local-first vector database.
- FAISS - Facebook AI Similarity Search.
Frameworks for chaining private AI tools & agents.
-
AG2 - Open-source operating system for agentic AI with native Ollama support for local model deployment and multi-agent collaboration.
-
LangChain - Agent and LLM orchestration framework.
-
Langflow - Visual workflow builder for creating and deploying AI-powered agents and workflows with built-in API servers.
-
Haystack - End-to-end RAG pipelines.
-
Flowise - No-code LangChain UI.
-
LlamaIndex - Data framework for LLM apps.
-
Trae Agent - Privacy-friendly agent framework for orchestrating LLMs and tools, designed for secure, local, and scalable AI workflows.
-
Qwen-Agent - Open-source, privacy-friendly agent framework for orchestrating LLMs and tools, designed for secure, local, and scalable AI workflows.
-
Crush - Privacy-first, open-source agentic coding and automation platform for local AI workflows.
-
OpenCode AI - Open-source agentic coding platform for private, local, and secure AI-powered development workflows.
-
PydanticAI - Python agent framework by the Pydantic team, model-agnostic with Ollama support for local deployment.
-
sglang - Fast, privacy-first LLM inference and programming language for building composable, local AI workflows.
-
dspy - Modular, open-source agent framework for building composable, private LLM applications and workflows.
-
CUA - enables AI agents to control full operating systems in virtual containers and deploy them locally or to the cloud.
-
Bytebot - A desktop agent is an AI that has its own computer. Unlike browser-only agents or traditional RPA tools, Bytebot comes with a full virtual desktop.
Privacy-first, open-source agentic coding plugins and extensions for VS Code and other editors.
- Roo Code - Privacy-first, open-source agentic coding platform for secure, local AI development (VS Code extension).
- cline - Privacy-first, open-source agentic coding platform for local AI workflows and automation (VS Code extension).
Keep AI deployments secure and compliant.
- BlindAI - Confidential AI inference using TEEs.
- OpenFL - Federated learning framework.
- Flower - Federated learning at scale.
- Concrete - Fully homomorphic encryption for AI.
Open-weight models and model libraries you can self-host.
- LLaMA 3 - Meta’s open-weight language model.
- Mistral 7B - Dense 7B parameter model.
- Qwen 3 - A wide variety of general and specialized models in both dense and "Mixture of Experts" formats.
- Kimi K2 - Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters.
- Phi-4 - Small, high-quality models from Microsoft.
- Mixtral - Mixture-of-experts model.
- Falcon - Open-source model from TII.
- Gemma3 - Open source model from Google.
- MLX Community - Community-driven Hugging Face page for open MLX models, optimized for Apple Silicon and private deployment.
- Bielik - An open source project that provides data, tools and LLMs for the development of the Polish artificial intelligence landscape
Self-hosted chat & AI frontends.
- Chatbot UI - Open-source ChatGPT clone.
- LibreChat - Enhanced web UI for LLMs.
- AnythingLLM - Full-stack private LLM workspace.
- Open WebUI - Commonly recommended Web UI frontend which features built in search, web scrape, RAG, and optional user authentication.
Create and manage private training corpora.
- OpenWebText - Open dataset similar to GPT training data.
- RedPajama - Open LLM training dataset.
- Datamixers - Privacy-focused data preprocessing tools.
Guides, papers, and tutorials on private AI.
#TODO
Centralized routers and proxy layers for aggregating, governing, and securing your private AI stack. These tools simplify connections to multiple model servers, optimize LLM routing, and provide observability, security, and compliance.
- Nexus - Open-source AI router to aggregate Model Context Protocol (MCP) servers, intelligently route requests to the best LLMs, and provide security, governance, observability, and simplified architecture for private AI deployments. Blog
Contributions welcome! See Contributing
Under CC0-1.0 license. see LICENSE