Tools & Regulation

LLM Inference

Also known as: Inference, Model Serving

LLM inference is the moment a trained language model actually works: it receives a request and generates a response, token by token. Every chat answer is inference.

Training vs. inference

Training is the degree, inference is the day job. Training happens once and costs millions. Inference happens with every single request — millions of times a day. That is why inference drives the running costs of AI operations.

How costs are measured

Providers bill per token — split into input (the request) and output (the response). Output tokens usually cost significantly more. Planning a budget means estimating both: long answers weigh heavier than long prompts.

What determines speed

Three factors decide. First, model size: more parameters mean more computation per token. Second, hardware: GPUs are the standard, while specialised chips such as LPUs accelerate things considerably. Third, batching: processing many requests in parallel raises throughput but can delay individual responses.

Cloud or self-hosted

Hyperscalers such as AWS Bedrock, Azure AI Foundry, and Google Vertex AI host a wide range of models. Specialists such as Groq, Cerebras, Together AI, and Fireworks AI compete on speed and price. The alternative: self-hosted AI on company hardware — no per-token billing, but full operational responsibility.

Positioning

Discuss the next step in a free diagnostic call. Book a call →

As of: June 2026