Running Ollama on Apple Silicon in Production: Lessons from 6 Months

By Wingston Sharon | January 2025

Six months ago I bought two M2 Mac Minis, set them up as Ollama inference nodes, connected them to our Tailscale mesh, and pointed our Agentosaurus pipeline at them. At that point it felt like a bet — Mac Minis as production infrastructure seemed slightly absurd. Now we've served approximately 500,000 inference requests off them and I have opinions.

This is what I've learned.

Why Apple Silicon for Inference

The fundamental advantage is unified memory architecture. On an M2 Mac Mini with 16GB RAM, Ollama can run a 13B parameter model — because the GPU and CPU share the same memory pool. A discrete GPU with 16GB VRAM would also run a 13B model, but you can't also have 16GB for everything else the system needs. On the Mac Mini, the 16GB is shared, which means memory pressure is real but the ceiling is higher than it looks on paper.

For our use case — running llama3.1:8b and nomic-embed-text on an always-on inference server — this was perfect. The 8B model fits comfortably in 8GB, leaving headroom for the OS and Ollama's overhead.

The other advantage: M2 has genuine neural engine acceleration via Metal Performance Shaders (MPS). Tokens per second on an M2 Mini running an 8B model: roughly 35-45 tok/s for generation. That's not GPU-server fast, but it's fast enough for our organization analysis tasks, which don't need interactive response times.

Initial Setup

Install Ollama on macOS:

brew install ollama
# or download from ollama.com

Pull the models you need:

ollama pull llama3.1:8b
ollama pull nomic-embed-text

Start the server:

OLLAMA_HOST=0.0.0.0 ollama serve

The OLLAMA_HOST=0.0.0.0 binding is important — by default Ollama only listens on localhost, which means it's unreachable from other nodes on the Tailscale mesh.

Verify with a quick API call from another machine:

curl http://mac-mini-amsterdam.tail1234.ts.net:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Say hello", "stream": false}'

Auto-Start with launchd

On macOS, the equivalent of systemd is launchd. Create a plist at ~/Library/LaunchAgents/com.ollama.server.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>EnvironmentVariables</key>
    <dict>
        <key>OLLAMA_HOST</key>
        <string>0.0.0.0</string>
        <key>OLLAMA_KEEP_ALIVE</key>
        <string>24h</string>
        <key>OLLAMA_MAX_LOADED_MODELS</key>
        <string>2</string>
    </dict>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama-error.log</string>
</dict>
</plist>

Load and start it:

launchctl load ~/Library/LaunchAgents/com.ollama.server.plist
launchctl start com.ollama.server

The OLLAMA_KEEP_ALIVE=24h environment variable is critical. By default, Ollama unloads models from memory after 5 minutes of inactivity. With our workload pattern — burst crawl jobs, then quiet periods — models would get evicted and the first request after a quiet period would trigger a cold load. On an M2 Mini, loading llama3.1:8b takes 8-12 seconds. That's tolerable for interactive use but unacceptable when a Celery task is waiting. KEEP_ALIVE=24h keeps the model hot.

Health Check Script

We run this health check from our monitoring system every 5 minutes:

#!/bin/bash
# /usr/local/bin/ollama-health-check.sh

OLLAMA_HOST="${1:-localhost}"
OLLAMA_PORT="${2:-11434}"

# Check if Ollama is responding
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  "http://${OLLAMA_HOST}:${OLLAMA_PORT}/api/tags" \
  --max-time 5)

if [ "$HTTP_STATUS" != "200" ]; then
    echo "CRITICAL: Ollama not responding (HTTP $HTTP_STATUS)"
    exit 2
fi

# Check that our required models are loaded
TAGS=$(curl -s "http://${OLLAMA_HOST}:${OLLAMA_PORT}/api/tags")
LLAMA_PRESENT=$(echo "$TAGS" | python3 -c \
  "import sys, json; models = [m['name'] for m in json.load(sys.stdin)['models']]; \
   print('OK' if any('llama3.1:8b' in m for m in models) else 'MISSING')")

if [ "$LLAMA_PRESENT" != "OK" ]; then
    echo "WARNING: llama3.1:8b not in loaded models"
    exit 1
fi

echo "OK: Ollama healthy, models present"
exit 0

Usage:

./ollama-health-check.sh mac-mini-amsterdam.tail1234.ts.net 11434

The Batching Problem

This is the biggest operational limitation of Ollama on Apple Silicon: it doesn't do real batching.

Ollama processes one request at a time. If two Celery tasks hit the same Ollama node simultaneously, the second request queues. There's no parallel inference, no batching of prompts through the same forward pass.

On a GPU server running vLLM or TGI, you can batch 16+ requests through the model simultaneously because the matrix multiplications parallelize across CUDA cores. Apple's MPS doesn't expose the same kind of deep batch control.

In practice this means: for high-throughput batch jobs (crawling 200 organizations and scoring each one), we end up serializing through the inference nodes. Our Celery configuration accounts for this:

# celery_config.py
CELERY_TASK_ROUTES = {
    'agentosaurus.tasks.score_organization': {
        'queue': 'inference',
        'rate_limit': '10/m',  # Max 10 inference tasks per minute per worker
    },
}

# One inference worker per node to avoid queuing overhead
CELERY_WORKER_CONCURRENCY = 1  # For inference workers

If you need high-throughput batch inference, Apple Silicon is the wrong tool. For steady-state analysis with moderate volume, it's fine.

Memory Pressure and macOS Memory Compression

macOS aggressively uses memory compression and swap — this interacts with Ollama in subtle ways.

When memory pressure is high, macOS may compress model weights in RAM. The next inference request decompresses them, adding latency. We've seen this cause the first few requests after a high-memory event to be 2-3x slower than normal.

Monitor memory pressure with:

# One-shot check
vm_stat | perl -ne '/page size of (\d+)/ and $size=$1; \
  /Pages free: (\d+)/ and printf "Free: %.2fGB\n", $1*$size/1073741824; \
  /Pages compressed: (\d+)/ and printf "Compressed: %.2fGB\n", $1*$size/1073741824'

# Or use the Activity Monitor memory pressure graph

We keep OLLAMA_MAX_LOADED_MODELS=2 to prevent Ollama from loading too many models simultaneously and causing its own memory pressure.

Context Window vs Memory Tradeoffs

With llama3.1:8b, the default context window in Ollama is 2048 tokens. You can extend this, but memory usage scales with context length (KV cache grows linearly with context window size).

For our organization analysis prompts — which include crawled content that can be long — we extended the context:

# Create a custom Modelfile with extended context
cat > /tmp/Modelfile << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 8192
EOF

ollama create llama3.1-8b-8k -f /tmp/Modelfile

Going from 2048 to 8192 tokens increases KV cache memory usage significantly. On a 16GB machine with two models loaded, this is the knob you fiddle with when you see OOM-adjacent behavior. We settled on 4096 tokens as a reasonable compromise for our prompts.

What We'd Do Differently

Start with model pre-loading. We lost the first month to cold-start latency spikes. Set OLLAMA_KEEP_ALIVE from day one.

Size the nodes correctly. 16GB unified memory is fine for two 7-8B models. If you need a 13B model plus embeddings simultaneously, get the 32GB configuration. The memory math is: model weights (roughly 2x parameter count in GB at Q4 quant, so 8B ≈ 5GB) + KV cache (context window dependent) + overhead.

Don't use Mac Minis for high-throughput. For bulk batch processing at scale, we run those jobs against our Oracle Cloud GPU node instead. The Mac Minis handle steady-state single-request work well. They're the wrong shape for 500 concurrent inference requests.

Where GPU still beats MPS. Training (don't even try on MPS), very high concurrent throughput (>50 simultaneous requests), and models larger than your unified memory. Also: mixed-precision fine-tuning, anything that needs bfloat16 support at scale. MPS has improved enormously in PyTorch, but for production inference at scale, a dedicated Nvidia card with vLLM is still faster per dollar at the high end.

For a small team running a production AI pipeline at moderate volume, two M2 Mac Minis running Ollama is a legitimate infrastructure choice. Ours are still running.

Questions about our Ollama setup or inference architecture? Reach out at hello@agentosaurus.com.

Running Ollama on Apple Silicon in Production: Lessons from 6 Months

Running Ollama on Apple Silicon in Production: Lessons from 6 Months

Why Apple Silicon for Inference

Initial Setup

Auto-Start with launchd

Health Check Script

The Batching Problem

Memory Pressure and macOS Memory Compression

Context Window vs Memory Tradeoffs

What We'd Do Differently

Build This Infrastructure?

Related Articles

Mac M3 as a GPU Server: Running Real AI Workloads on Apple Silicon in Production

Sovereign GPU Cloud: How We Built Heterogeneous AI Infrastructure Without AWS

Building Your First Sovereign AI Pipeline: From Cloud to Owned Infrastructure