Self-hosted AI
Also known as: Self-hosting, local LLMs, on-premise AI
Self-hosted AI means language models run on a company’s own infrastructure instead of a vendor’s cloud. Requests and data never leave the organisation. The foundation is usually an open source LLM.
Technical requirements
Graphics memory (VRAM) is the deciding factor. Small models with 7–8 billion parameters run compressed from around 8 GB of VRAM — a decent consumer GPU. The 70B class needs roughly 40–48 GB compressed, for example two 24 GB cards. Uncompressed, it takes around 140 GB — data centre hardware.
Common tools
Three open-source tools dominate: Ollama for an easy start on single machines, vLLM for high-performance production serving, and Open WebUI as a browser-based chat interface. All three are actively maintained.
Benefits and drawbacks
The benefits: full data sovereignty, no per-token API costs (LLM inference), and adaptability up to fine-tuning. The drawbacks: hardware investment, operational responsibility — updates, security, monitoring — and model upgrades handled in-house.
Who it makes sense for
Self-hosting pays off with sensitive data, high request volumes, or strict compliance requirements — such as data residency in the EU. At low volumes, cloud inference is usually cheaper and faster to launch.
Discuss the next step in a free diagnostic call. Book a call →
As of: June 2026