Before Ollama, running a large language model locally required: downloading a multi-gigabyte GGUF file, installing llama.cpp, compiling with the right CUDA or Metal flags, figuring out the correct context window settings, and writing your own HTTP server if you wanted to call it from an application.
Experienced ML engineers treated this as routine. For application developers, it was a day-long project with a 40% chance of ending in “failed to allocate memory.”
Ollama ships as a single binary. You run ollama pull llama3.2 and then call http://localhost:11434/api/generate. That is it.
What Ollama Actually Is
Ollama is a local model server that wraps llama.cpp. It handles:
- Downloading and storing model files from a registry
- GPU acceleration detection and configuration (Apple Metal, NVIDIA CUDA, AMD ROCm)
- Model loading and memory management
- A REST API with an OpenAI-compatible endpoint
- Concurrent model execution (it can load multiple models and swap between them)
The CLI mirrors Docker’s interface deliberately:
ollama pull mistral # Download a model
ollama run llama3.2 # Run it in the terminal
ollama list # Show downloaded models
ollama rm llama3.2 # Delete a model
If you have used Docker, the mental model transfers immediately.
The OpenAI Compatibility Layer
This is the detail that made Ollama actually useful for developers. Ollama serves an OpenAI-compatible API at /api/ and /v1/:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # Any non-empty string
)
response = client.chat.completions.create(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Explain database indexing in two paragraphs'}
]
)
This is the same code that calls OpenAI’s API. You change base_url and model and you are calling a local model. Any application built with OpenAI’s SDK can point at Ollama with two configuration changes.
This enabled an ecosystem of local-first tools almost immediately. Open WebUI, Continue.dev, and dozens of other developer tools added Ollama support within months of its launch.
The Models Worth Running
Model availability has expanded significantly. The practical local options in 2025:
| Model | Size | Use case |
|---|---|---|
| llama3.2:3b | 2.0GB | Fast, general purpose, 8GB RAM minimum |
| llama3.2:latest (8b) | 4.7GB | Better quality, 16GB RAM recommended |
| mistral:7b | 4.1GB | Strong reasoning, code |
| codellama:7b | 3.8GB | Code completion, explanation |
| deepseek-coder-v2:16b | 9.0GB | Best local coding model, 24GB RAM |
| qwen2.5-coder:7b | 4.7GB | Strong code model, 16GB RAM |
| nomic-embed-text | 274MB | Embeddings for RAG |
On Apple Silicon with 16GB unified memory, llama3.2:latest runs at 20-40 tokens per second - usable for development assistance, though slower than API-hosted models.
The Privacy Case
The most compelling reason for Ollama in production contexts is data privacy. Sending code to OpenAI’s API means the code leaves your machine and is processed on external servers. For:
- Code containing business logic or trade secrets
- Personal data subject to GDPR or HIPAA
- Security-sensitive operations
- Air-gapped environments
Local models eliminate the data residency problem entirely. The model never sees your data on a network.
Several companies have built internal coding assistants on Ollama specifically because they can offer developers AI assistance without routing proprietary code to third-party servers.
Developer Workflow Integration
Ollama integrates with the tools developers already use:
# Continue.dev for VS Code - .continue/config.json
{
"models": [{
"title": "Llama 3.2",
"provider": "ollama",
"model": "llama3.2"
}]
}
# Shell alias for quick queries
alias ask='ollama run llama3.2'
ask "What flags does rsync use to preserve permissions?"
# RAG with local embeddings
import ollama
def embed(text: str) -> list[float]:
response = ollama.embeddings(model='nomic-embed-text', prompt=text)
return response['embedding']
The embedding endpoint enables local RAG pipelines that never send documents to external APIs.
Where Ollama Falls Short
Quality gap: The best local models (Llama 3.2, Mistral, Qwen 2.5) are significantly behind GPT-4o and Claude 3.5 Sonnet on complex reasoning tasks. For code generation on simple tasks, the gap is smaller. For multi-step reasoning, architecture review, or tasks requiring broad context, API models are better.
Memory requirements: A useful coding assistant requires at least 8GB of RAM dedicated to the model. On a machine with 16GB total, this is manageable. On a 8GB laptop, model performance degrades significantly from memory pressure.
Inference speed: Even on M3 Pro silicon, 20-40 tokens/second feels slow compared to API-hosted inference at 80-150 tokens/second. For interactive use, the wait is noticeable.
Model diversity: The hosted API ecosystem has models trained for specific domains (legal, medical, finance). Local models are mostly general-purpose.
Bottom Line
Ollama removed the engineering friction from local model deployment entirely. Whether you are building privacy-sensitive applications, experimenting with models without API costs, or just want a local coding assistant, the setup is now measured in minutes rather than days.
The quality ceiling for local models is real but rising rapidly. For many developer assistance tasks, Llama 3.2 and Qwen 2.5-Coder are good enough. For tasks requiring the best available reasoning, you still need API access. Ollama does not replace cloud inference - it makes local inference accessible enough to use for the right subset of tasks.
Comments