Ollama in the Enterprise: Local AI for Data That Never Leaves the Building
Ollama setup for SMBs: hardware selection, model comparison (Llama 3.3 vs. Mistral Large vs. GPT-OSS), installation, your first agent. When the local route really pays off -- and when it doesn't.
Ollama in the Enterprise: Local AI for Data That Never Leaves the Building
Local AI is the most expensive way to run an AI agent. It is also the only one where you can answer honestly in two sentences when a client asks: "Where does my data go?"
"It doesn't go anywhere. It sits in our server room, on a machine with no internet uplink."
That answer is gold for law firms, medical practices, tax advisors, accounting firms, government agencies -- and for any business whose clients explicitly demand "no US providers, no EU cloud." For everyone else, Ollama is often overkill. In 2026, Azure OpenAI is the better default for 70 percent of SMBs.
If you followed the pillar guide to the third matrix question and answered "yes, sensitive enough for local," this article is for you.
What Ollama Actually Is
Ollama is open-source software that makes language models runnable locally on your own server. You install Ollama, Ollama downloads the model of your choice, and you can access your own "mini-OpenAI" through an API.
What Ollama is not: a model. The models (Llama 3.3, Mistral Large, GPT-OSS, Gemma 3) come from Meta, Mistral, OpenAI, and Google. Ollama is just the runtime that gets them ready to use.
The practical reality in 2026: local models are no longer far behind cloud models, but they are still behind. Llama 3.3 70B is roughly at GPT-4-Turbo level for standard tasks, somewhat weaker on truly complex reasoning. For 90 percent of agent use cases, that is enough.
Hardware: What You Actually Need
Hardware selection is the cost trap. The problem isn't "too small" -- it's "too small for the model you actually need." Then you're sitting on EUR 4,000 of hardware that won't load Llama 3.3 70B.
Rule of thumb: you need roughly as much VRAM as the model is large. Plus headroom. An 8B model runs in 8 GB of VRAM. A 70B model in its 4-bit quantized variant runs in 40 GB of VRAM. You don't need a 405B model -- that's research territory, not SMB.
| Model Size | VRAM Required | Hardware Suggestion | Investment | For What |
|---|---|---|---|---|
| 8B (e.g., Llama 3.3 8B) | 8 GB | RTX 4060 / Mac M2/M3 | ~EUR 1,500 | Classification, simple tasks |
| 70B (e.g., Llama 3.3 70B) | 40 GB (4-bit quantized) | RTX 4090 + 64 GB RAM, Mac M3 Max 64 GB | EUR 3-5k | Standard agents, RAG, mid-complexity tasks |
| 120B (GPT-OSS-120B) | ~80 GB | 2x RTX 4090 or server with A100/H100 | EUR 10-25k | Complex reasoning tasks |
Standard setup for 80 percent of SMBs: RTX 4090 with 24 GB VRAM in a workstation chassis. One-time cost ~EUR 4,000. Llama 3.3 70B runs quantized. Pays off versus cloud once monthly usage passes EUR 200.
Important: NVIDIA, not AMD. AMD GPUs are technically supported by Ollama, but in 2026 the ROCm setup is still significantly more finicky than CUDA. If you aren't running it yourself, choose NVIDIA.
Installation in 15 Minutes
I'm assuming a Linux server (Ubuntu 24.04) with NVIDIA drivers and CUDA already installed. If you're installing on macOS with Apple Silicon, it's even easier -- Ollama ships with Metal support out of the box.
Dieses Thema vertiefen? 32 KI-Rezepte mit Kostenrahmen als kostenloses PDF.
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
The script installs the binary, creates a systemd service, and starts Ollama on port 11434. Thirty seconds later, it's running.
Pull Your First Model
ollama pull llama3.3:70b
# or smaller for a first test:
ollama pull llama3.3:8b
The download is about 4 GB for the 8B variant and 40 GB for the 70B. Plan your bandwidth accordingly.
First Test
ollama run llama3.3:8b "Reply with OK"
If you get a response back, the setup is working. The first response is slower than subsequent ones because the model has to be loaded into VRAM.
API Access: Ollama Speaks OpenAI
The nice thing about Ollama: it offers an OpenAI-compatible API. So if you already have code that talks to the OpenAI API, you only need to change the base URL and swap the model name.
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:8b",
"messages": [
{"role": "user", "content": "Reply with OK"}
]
}'
That's exactly the same structure as OpenAI -- just without the Authorization header (local, no key needed) and with a different model name. Tools like n8n, LangChain, and LlamaIndex support Ollama as a drop-in.
Which Model for Which Purpose
Here's an honest assessment from our own testing, as of April 2026:
Llama 3.3 70B -- the standard. Good at following instructions, solid English and German, average on complex reasoning. The right choice for 80 percent of agent tasks.
Mistral Large 2 -- somewhat better at code generation and structured output (JSON). If your agent has to return JSON frequently, the switch is worth it.
GPT-OSS-120B -- OpenAI's open-weights variant. Stronger at reasoning. The catch: it needs twice as much hardware. Only worth it for genuinely complex tasks.
Gemma 3 27B -- Google's model. Very strong on multimodal tasks (reading images). If you're doing document analysis, a good secondary candidate.
Qwen 2.5 72B -- Alibaba's model, better in Chinese, but solid in English and German too. Stronger on math-heavy tasks. Niche, but capable.
Test two models side by side against your real tasks. Benchmarks tell you little. Only your use cases decide.
Performance Reality Check
The uncomfortable truth: local models are slower than cloud models. Not in model quality, but in throughput. While GPT-4o on an Azure server delivers roughly 100 tokens per second, Llama 3.3 70B on an RTX 4090 runs at about 15 to 25 tokens per second.
What that means in practice: a response that takes two seconds in the cloud takes eight to twelve seconds locally. For asynchronous agents (email triage, document classification) that's irrelevant. For real-time chat with customers, it's noticeable.
If speed is critical, you either need better hardware (multiple GPUs in parallel, inference-optimized cards like A100/H100) or you combine local and cloud: local for sensitive data, cloud for non-critical fast responses.
What the GDPR Documentation Says (and Why It's Shorter)
The big advantage of Ollama for your DPIA: the data processor section disappears. There is no external processor. You process the data yourself, on your own premises, on your own infrastructure.
Data processor: None. Processing takes place entirely on your own premises and on your own hardware.
Model supplier: Meta (Llama) / Mistral AI / OpenAI (GPT-OSS) / Google (Gemma) -- but only for one-time provision of the model, with no ongoing data flow.
Data location: [Your address], server room [designation]. Network with no internet uplink.
Legal basis: Identical to your other IT data processing -- Art. 6(1)(b) or (f) GDPR.
Model training: Does not occur. Models are used for inference only and are not retrained.
What you have to document more than with cloud: the technical and organizational measures (TOMs) for your server. Who has physical access? How is patching handled? What are the backups like? That's standard IT security, applies to any server, isn't AI-specific -- but it has to be in the DPIA.
When You Should NOT Go Local
Three situations where we at kiba advise against local:
When nobody on your team knows Linux or server operations. Ollama isn't "click and it runs." It runs at first. Six months later, when a driver update breaks something or the server has to reboot, you need someone who knows what they're doing. If that's nobody and you have no external IT partner, cloud is the more practical path despite the weaker GDPR position.
When your volume is below 100,000 tokens per day. At 2026 token prices, that's less than EUR 50 per month in the cloud. A EUR 4,000 hardware investment never pays off there. Cloud is cheaper when volume is small.
When you need maximum model quality. Local models in 2026 are good, but GPT-4o, Claude Opus 4, and Gemini 2.5 Pro are still a step ahead. If your use case demands the very best reasoning -- complex contract analysis, multi-step planning tasks -- local works, but you're giving up quality.
Your First Local Agent
Once the setup is in place, building your first agent is no different than in the cloud. You use n8n, LangChain, or your own script, point it at http://localhost:11434/v1 instead of OpenAI, and the workflow is identical.
We'll lay out the concrete blueprint for a first agent -- with trigger, steps, outputs -- in the next article in the series. There we use n8n as the orchestrator and can plug in either local or cloud as the backend.
When the Hardware Choice Isn't Clear
The hardware decision is the most expensive single decision in a local setup. The wrong GPU means either wasted performance or a model that won't fit. Through our AI consulting we help with concrete hardware recommendations, build the setup, and hand over a fully documented local AI system.
Contact: info@kiba.berlin.
Part 4 of the series. Back to Azure OpenAI | Next: Your first n8n agent
32 KI-Rezepte für den Mittelstand
Kostenloser Praxisleitfaden mit Kostenrahmen, Entscheidungsmatrix und Fördermittel-Guide für KMU.
PDF kostenlos herunterladenBereit für den nächsten Schritt?
Sprechen Sie mit unseren KI-Experten – der erste Beratungstermin ist kostenlos und unverbindlich.
This article is part of our comprehensive guide: AI for SMEs — The Complete Guide for Medium-Sized Businesses
Ähnliche Artikel

KI-Agenten im Unternehmen: Wo du wirklich anfängst (und warum 90 % scheitern)
Pillar-Guide für den ehrlichen Einstieg in KI-Agenten: die drei Wege (OpenAI, Azure, lokal), wann welcher passt, und die fünf Fallen, in die fast alle tappen.

Lokale KI vs. Cloud-KI: Der DSGVO-Vergleich für deutsche Unternehmen
Cloud-KI oder lokale KI? Ein ehrlicher Vergleich für deutsche Unternehmen: DSGVO-Konformität, Kosten, Leistung und wann welche Lösung die richtige ist.

DeepScroll und Recursive Language Models — Warum 10M+ Kontext bei großen Codebases praktisch Gold wert ist
DeepScroll als Open-Source-Werkzeug für rekursive Kontextnavigation: Warum 10M+ Tokens bei großen Codebases nicht an Fenstergröße, sondern an Architektur hängen.