Building Your First Sovereign AI Pipeline: From Cloud to Owned Infrastructure
---
Building Your First Sovereign AI Pipeline: From Cloud to Owned Infrastructure
By Wingston Sharon | January 2025
When I started building AI pipelines for EU clients who needed data sovereignty, the options were either "use OpenAI and accept the compliance trade-offs" or "build something from scratch that would take six months." That gap has closed significantly. In 2025, running capable AI on your own infrastructure is genuinely achievable for organizations with modest technical capacity.
This guide covers the path from a cloud AI API dependency to a sovereign, locally-operated AI pipeline. I'll be direct about where the capability gaps still exist, because overselling local models is a disservice. But for a large class of use cases, local models are production-ready today.
Step 1: Inventory Your AI Use Cases
Before touching infrastructure, you need to understand what you're actually doing with AI. The right migration strategy depends entirely on which use cases need frontier capability and which don't.
Use cases that typically work well with local models (7Bโ70B parameter range):
- Document summarization and extraction
- Text classification and categorization
- Named entity recognition
- RAG (retrieval-augmented generation) pipelines where the retrieval handles most of the factual work
- Code completion and boilerplate generation
- Translation between major languages
- Sentiment analysis
- Email drafting from structured data
Use cases that currently benefit from frontier models:
- Complex multi-step reasoning chains
- Novel code architecture decisions
- Research synthesis across large, ambiguous corpora
- Nuanced legal and financial document analysis
- Tasks where the model needs to catch its own errors reliably
- Long-context tasks requiring coherence over 50k+ tokens
Go through your current AI API usage and categorize each use case. Most organizations will find that 60โ80% of their AI calls fall into the "works well locally" category. The remaining 20โ40% might still justify a frontier model API relationship โ but a hybrid approach means that data stays at home for the majority of your processing.
Step 2: Model Selection
The open-source model landscape has matured significantly. Here are the models I currently recommend for sovereign deployments:
General text generation and instruction following:
| Model | Parameters | VRAM Required | Capability Level | Best For |
|---|---|---|---|---|
| Mistral 7B Instruct | 7B | 8GB | Good | High-throughput, cost-sensitive |
| Mixtral 8x7B Instruct | 47B (MoE) | 24โ48GB | Very good | Complex reasoning, production quality |
| Llama 3.1 8B Instruct | 8B | 8โ10GB | Good | General use, well-documented |
| Llama 3.1 70B Instruct | 70B | 40โ80GB | Excellent | Near-frontier for many tasks |
| Mistral Small | 22B | 16โ24GB | Very good | Balance of quality and resource use |
Embeddings (essential for RAG pipelines):
nomic-embed-textโ 768 dimensions, fast, good quality, runs on CPUmxbai-embed-largeโ 1024 dimensions, higher quality for retrieval tasksall-minilmโ small and fast for high-throughput scenarios
Code-specific models:
deepseek-coder-v2โ strong performance on code generation and completioncodestral(Mistral) โ fast and capable for code tasks
Recommendation for most EU organizations starting out: Begin with Mistral 7B or Llama 3.1 8B on Ollama for development and low-stakes production. Upgrade to Mixtral 8x7B or Llama 3.1 70B as you identify use cases that need more capability.
Step 3: Serving Infrastructure
Single-node deployment (most organizations should start here):
Ollama is the easiest path to running local models. It handles model download, quantization selection, GPU offloading, and provides an OpenAI-compatible API endpoint.
# Install Ollama on Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and serve a model
ollama pull mistral
ollama serve # Starts API server on localhost:11434
Ollama works on CPU-only machines (slower, but viable for low-throughput use cases), NVIDIA GPU, and Apple Silicon. It handles quantization automatically โ by default it will select a 4-bit quantized version of the model that fits in your available VRAM.
Multi-GPU or high-throughput deployment:
vLLM is the production-grade option when you need throughput and have NVIDIA GPU infrastructure.
pip install vllm
# Serve Mistral 7B with vLLM
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--host 0.0.0.0 \
--port 8000
vLLM supports continuous batching, which dramatically improves GPU utilization for concurrent requests. Use this when you have >10 concurrent users or are processing large document queues.
Apple Silicon (M-series Macs):
Ollama supports MPS (Metal Performance Shaders) on Apple Silicon, giving you GPU-accelerated inference without NVIDIA hardware. A MacBook Pro M3 Max can run Mistral 7B at production-usable speeds. A Mac Studio with 192GB unified memory can run 70B models.
This is a legitimate option for EU organizations that need sovereignty without standing up dedicated GPU servers โ you can run inference on an M3 Mac mini in your own data centre for โฌ1,200.
EU cloud infrastructure options (if you prefer managed hosting over self-hosted):
If you need cloud flexibility without US sovereignty concerns, GPU compute is available from EU-domiciled providers:
- Hetzner (Germany) โ GPU servers available, good price point
- OVHcloud (France) โ A100 instances, enterprise SLAs
- Exoscale (Switzerland) โ GPU compute with Swiss jurisdiction
Step 4: Integration โ The Drop-In Replacement
This is where the practical value of Ollama and vLLM becomes clear. Both expose an OpenAI-compatible API. This means you can replace OpenAI API calls with a single environment variable change for the base URL.
Before (OpenAI):
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "Summarize this document: ..."}
]
)
print(response.choices[0].message.content)
After (Ollama, drop-in replacement):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by the SDK but not validated
)
response = client.chat.completions.create(
model="mistral", # Local model name
messages=[
{"role": "user", "content": "Summarize this document: ..."}
]
)
print(response.choices[0].message.content)
The diff is two lines: base_url and api_key. Your existing application code, error handling, and response parsing all continue to work.
For applications using LangChain, LlamaIndex, or other orchestration frameworks, the swap is similarly straightforward โ these frameworks all support Ollama and vLLM as backends.
Environment-based configuration (recommended pattern):
import os
from openai import OpenAI
# Set AI_BASE_URL=http://localhost:11434/v1 for local
# Set AI_BASE_URL=https://api.openai.com/v1 for cloud fallback
client = OpenAI(
base_url=os.getenv("AI_BASE_URL", "https://api.openai.com/v1"),
api_key=os.getenv("AI_API_KEY", ""),
)
model = os.getenv("AI_MODEL", "gpt-4o-mini")
This pattern lets you switch between local and cloud via environment variables, which is useful during migration and for maintaining a cloud fallback for specific use cases.
Step 5: Evaluating Quality
Do not assume local models are good enough for your use cases โ evaluate them against your actual tasks.
A practical A/B testing approach:
- Take 50โ100 representative examples of your actual production inputs
- Run them through both your current cloud model and your candidate local model
- Score outputs against your quality criteria (manually for subjective tasks, automatically for structured outputs)
- Calculate the percentage of outputs where local model quality is acceptable
For structured extraction tasks (pull specific fields from documents), local models often match cloud model accuracy within a few percentage points. For open-ended creative or complex reasoning tasks, the gap can be larger.
Things to specifically test:
- Instruction following โ does the model reliably follow your output format instructions (JSON, structured lists, specific response formats)?
- Refusals โ some smaller open-source models over-refuse legitimate requests. Test with your actual content.
- Consistency โ run the same prompt 5 times. Does the model give consistent answers? Frontier models tend to be more consistent on ambiguous tasks.
- Edge cases โ test with the hardest 10% of your real inputs, not just the typical cases
Document your evaluation results. You will need them to justify the migration to stakeholders, and they will serve as your baseline for monitoring quality over time.
Step 6: The Hybrid Architecture
Running everything locally is not always the right answer. The architecture I most commonly recommend to EU clients is a hybrid:
Local processing for:
- High-volume, routine tasks (classification, extraction, summarization)
- Tasks involving personal data that requires EU-sovereign handling
- Low-latency applications where you want deterministic infrastructure costs
Cloud models for:
- Highest-stakes, complex reasoning where quality is paramount
- Novel tasks you haven't evaluated local models against yet
- Cases where your evaluation shows a material quality gap
In a hybrid architecture, you can route requests based on a simple classification: task type, data sensitivity, or quality requirements. The routing logic can itself be simple rule-based logic rather than another AI call.
def select_model(task_type: str, contains_pii: bool) -> tuple[str, str]:
"""Returns (base_url, model_name) for the given task."""
if contains_pii or task_type in ("classification", "extraction", "summarization"):
return ("http://localhost:11434/v1", "mixtral")
else:
return ("https://api.openai.com/v1", "gpt-4o")
What to Expect Going In
I want to be honest about the current state of things, because I've seen organizations get burned by overpromising on local AI.
The capability gap is real. Llama 3.1 70B is excellent. It is not GPT-4o. For tasks that genuinely require frontier reasoning โ complex code architecture, nuanced multi-document analysis, catching subtle contradictions across long texts โ frontier models still have an edge. That edge is narrowing with each model release, but it exists today.
GPU hardware is a real cost. A single NVIDIA A100 for running a 70B model costs โฌ2,000โ3,000/month from an EU cloud provider, or โฌ10,000+ to purchase. The economics make sense for organizations with high AI API spend (say, >โฌ3,000/month on cloud APIs) but not for organizations just getting started.
Maintenance is real work. Self-hosted infrastructure requires someone to monitor it, update models, handle hardware failures, and manage GPU drivers. Cloud APIs have none of this overhead.
Start with Ollama on a machine you already have. Evaluate your use cases against real workloads. Make the infrastructure investment only when you have evidence that it makes sense for your specific situation.
If you want to discuss what a sovereign AI pipeline would look like for your organization's specific use cases, reach out at hello@agentosaurus.com.
Build This Infrastructure?
We help AI teams build sovereign GPU clouds and autonomous systems. Free 30-minute consultation. Fixed-price projects from โฌ5K.
Schedule Free ConsultationRelated Articles
Mac M3 as a GPU Server: Running Real AI Workloads on Apple Silicon in Production
---
Sovereign GPU Cloud: How We Built Heterogeneous AI Infrastructure Without AWS
Most AI startups assume you need NVIDIA A100s and a managed cloud provider. We didn't.
Running Ollama on Apple Silicon in Production: Lessons from 6 Months
---