Running Ollama on Apple Silicon in Production: Lessons from 6 Months
---
Running Ollama on Apple Silicon in Production: Lessons from 6 Months
By Wingston Sharon | January 2025
Six months ago I bought two M2 Mac Minis, set them up as Ollama inference nodes, connected them to our Tailscale mesh, and pointed our Agentosaurus pipeline at them. At that point it felt like a bet โ Mac Minis as production infrastructure seemed slightly absurd. Now we've served approximately 500,000 inference requests off them and I have opinions.
This is what I've learned.
Why Apple Silicon for Inference
The fundamental advantage is unified memory architecture. On an M2 Mac Mini with 16GB RAM, Ollama can run a 13B parameter model โ because the GPU and CPU share the same memory pool. A discrete GPU with 16GB VRAM would also run a 13B model, but you can't also have 16GB for everything else the system needs. On the Mac Mini, the 16GB is shared, which means memory pressure is real but the ceiling is higher than it looks on paper.
For our use case โ running llama3.1:8b and nomic-embed-text on an always-on inference server โ this was perfect. The 8B model fits comfortably in 8GB, leaving headroom for the OS and Ollama's overhead.
The other advantage: M2 has genuine neural engine acceleration via Metal Performance Shaders (MPS). Tokens per second on an M2 Mini running an 8B model: roughly 35-45 tok/s for generation. That's not GPU-server fast, but it's fast enough for our organization analysis tasks, which don't need interactive response times.
Initial Setup
Install Ollama on macOS:
brew install ollama
# or download from ollama.com
Pull the models you need:
ollama pull llama3.1:8b
ollama pull nomic-embed-text
Start the server:
OLLAMA_HOST=0.0.0.0 ollama serve
The OLLAMA_HOST=0.0.0.0 binding is important โ by default Ollama only listens on localhost, which means it's unreachable from other nodes on the Tailscale mesh.
Verify with a quick API call from another machine:
curl http://mac-mini-amsterdam.tail1234.ts.net:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "Say hello", "stream": false}'
Auto-Start with launchd
On macOS, the equivalent of systemd is launchd. Create a plist at ~/Library/LaunchAgents/com.ollama.server.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.server</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_HOST</key>
<string>0.0.0.0</string>
<key>OLLAMA_KEEP_ALIVE</key>
<string>24h</string>
<key>OLLAMA_MAX_LOADED_MODELS</key>
<string>2</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ollama.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ollama-error.log</string>
</dict>
</plist>
Load and start it:
launchctl load ~/Library/LaunchAgents/com.ollama.server.plist
launchctl start com.ollama.server
The OLLAMA_KEEP_ALIVE=24h environment variable is critical. By default, Ollama unloads models from memory after 5 minutes of inactivity. With our workload pattern โ burst crawl jobs, then quiet periods โ models would get evicted and the first request after a quiet period would trigger a cold load. On an M2 Mini, loading llama3.1:8b takes 8-12 seconds. That's tolerable for interactive use but unacceptable when a Celery task is waiting. KEEP_ALIVE=24h keeps the model hot.
Health Check Script
We run this health check from our monitoring system every 5 minutes:
#!/bin/bash
# /usr/local/bin/ollama-health-check.sh
OLLAMA_HOST="${1:-localhost}"
OLLAMA_PORT="${2:-11434}"
# Check if Ollama is responding
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
"http://${OLLAMA_HOST}:${OLLAMA_PORT}/api/tags" \
--max-time 5)
if [ "$HTTP_STATUS" != "200" ]; then
echo "CRITICAL: Ollama not responding (HTTP $HTTP_STATUS)"
exit 2
fi
# Check that our required models are loaded
TAGS=$(curl -s "http://${OLLAMA_HOST}:${OLLAMA_PORT}/api/tags")
LLAMA_PRESENT=$(echo "$TAGS" | python3 -c \
"import sys, json; models = [m['name'] for m in json.load(sys.stdin)['models']]; \
print('OK' if any('llama3.1:8b' in m for m in models) else 'MISSING')")
if [ "$LLAMA_PRESENT" != "OK" ]; then
echo "WARNING: llama3.1:8b not in loaded models"
exit 1
fi
echo "OK: Ollama healthy, models present"
exit 0
Usage:
./ollama-health-check.sh mac-mini-amsterdam.tail1234.ts.net 11434
The Batching Problem
This is the biggest operational limitation of Ollama on Apple Silicon: it doesn't do real batching.
Ollama processes one request at a time. If two Celery tasks hit the same Ollama node simultaneously, the second request queues. There's no parallel inference, no batching of prompts through the same forward pass.
On a GPU server running vLLM or TGI, you can batch 16+ requests through the model simultaneously because the matrix multiplications parallelize across CUDA cores. Apple's MPS doesn't expose the same kind of deep batch control.
In practice this means: for high-throughput batch jobs (crawling 200 organizations and scoring each one), we end up serializing through the inference nodes. Our Celery configuration accounts for this:
# celery_config.py
CELERY_TASK_ROUTES = {
'agentosaurus.tasks.score_organization': {
'queue': 'inference',
'rate_limit': '10/m', # Max 10 inference tasks per minute per worker
},
}
# One inference worker per node to avoid queuing overhead
CELERY_WORKER_CONCURRENCY = 1 # For inference workers
If you need high-throughput batch inference, Apple Silicon is the wrong tool. For steady-state analysis with moderate volume, it's fine.
Memory Pressure and macOS Memory Compression
macOS aggressively uses memory compression and swap โ this interacts with Ollama in subtle ways.
When memory pressure is high, macOS may compress model weights in RAM. The next inference request decompresses them, adding latency. We've seen this cause the first few requests after a high-memory event to be 2-3x slower than normal.
Monitor memory pressure with:
# One-shot check
vm_stat | perl -ne '/page size of (\d+)/ and $size=$1; \
/Pages free: (\d+)/ and printf "Free: %.2fGB\n", $1*$size/1073741824; \
/Pages compressed: (\d+)/ and printf "Compressed: %.2fGB\n", $1*$size/1073741824'
# Or use the Activity Monitor memory pressure graph
We keep OLLAMA_MAX_LOADED_MODELS=2 to prevent Ollama from loading too many models simultaneously and causing its own memory pressure.
Context Window vs Memory Tradeoffs
With llama3.1:8b, the default context window in Ollama is 2048 tokens. You can extend this, but memory usage scales with context length (KV cache grows linearly with context window size).
For our organization analysis prompts โ which include crawled content that can be long โ we extended the context:
# Create a custom Modelfile with extended context
cat > /tmp/Modelfile << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 8192
EOF
ollama create llama3.1-8b-8k -f /tmp/Modelfile
Going from 2048 to 8192 tokens increases KV cache memory usage significantly. On a 16GB machine with two models loaded, this is the knob you fiddle with when you see OOM-adjacent behavior. We settled on 4096 tokens as a reasonable compromise for our prompts.
What We'd Do Differently
Start with model pre-loading. We lost the first month to cold-start latency spikes. Set OLLAMA_KEEP_ALIVE from day one.
Size the nodes correctly. 16GB unified memory is fine for two 7-8B models. If you need a 13B model plus embeddings simultaneously, get the 32GB configuration. The memory math is: model weights (roughly 2x parameter count in GB at Q4 quant, so 8B โ 5GB) + KV cache (context window dependent) + overhead.
Don't use Mac Minis for high-throughput. For bulk batch processing at scale, we run those jobs against our Oracle Cloud GPU node instead. The Mac Minis handle steady-state single-request work well. They're the wrong shape for 500 concurrent inference requests.
Where GPU still beats MPS. Training (don't even try on MPS), very high concurrent throughput (>50 simultaneous requests), and models larger than your unified memory. Also: mixed-precision fine-tuning, anything that needs bfloat16 support at scale. MPS has improved enormously in PyTorch, but for production inference at scale, a dedicated Nvidia card with vLLM is still faster per dollar at the high end.
For a small team running a production AI pipeline at moderate volume, two M2 Mac Minis running Ollama is a legitimate infrastructure choice. Ours are still running.
Questions about our Ollama setup or inference architecture? Reach out at hello@agentosaurus.com.
Build This Infrastructure?
We help AI teams build sovereign GPU clouds and autonomous systems. Free 30-minute consultation. Fixed-price projects from โฌ5K.
Schedule Free ConsultationRelated Articles
Mac M3 as a GPU Server: Running Real AI Workloads on Apple Silicon in Production
---
Sovereign GPU Cloud: How We Built Heterogeneous AI Infrastructure Without AWS
Most AI startups assume you need NVIDIA A100s and a managed cloud provider. We didn't.
Building Your First Sovereign AI Pipeline: From Cloud to Owned Infrastructure
---