Eighteen months ago our AI tooling bill crossed $50,000 per year and kept growing. Commercial LLM APIs, a vector database with per-query pricing, an embedding service, an AI observability platform, and a prompt management tool. Each individually defensible. Together, unsustainable.
We spent 90 days migrating to open source alternatives. The result: $11,000 per year (mostly infrastructure) and tooling we can extend without hitting API limits.
The Before Stack
| Tool | Purpose | Annual cost |
|---|---|---|
| OpenAI GPT-4 API | Primary LLM inference | $28,000 |
| Pinecone | Vector database | $8,400 |
| Cohere | Embeddings | $4,800 |
| Langsmith | LLM observability | $3,600 |
| PromptLayer | Prompt versioning | $2,400 |
| Helicone | API proxy + analytics | $1,800 |
| Total | $49,000 |
Usage was growing 15% per quarter. Projecting forward: $80K+ by end of year.
The Replacement Stack
Ollama + Local Models for Internal Tooling
ollama pull llama3.1:70b
ollama pull mistral:7b
ollama serve
Ollama runs LLMs locally with a simple REST API. For internal tooling - summarization, classification, code review assistance, content moderation - you do not need GPT-4. Llama 3.1 70B on an A100 handles these tasks at equivalent quality.
We kept OpenAI API access for customer-facing features where quality differences are visible. Internal tooling (40% of our tokens) moved to self-hosted models on a single A100 ($2.10/hour when in use). Monthly cost: $180-220. Annual cost for that workload: ~$2,200 instead of $11,200.
pgvector for Vector Storage
CREATE EXTENSION vector;
CREATE TABLE embeddings (
id UUID PRIMARY KEY,
content TEXT,
metadata JSONB,
embedding VECTOR(1536)
);
CREATE INDEX embeddings_vector_idx
ON embeddings USING ivfflat(embedding vector_cosine_ops)
WITH (lists = 100);
-- Semantic search
SELECT content, metadata, embedding <=> $1::vector AS distance
FROM embeddings
ORDER BY distance
LIMIT 10;
pgvector runs inside our existing Postgres instance. No separate service, no per-query pricing. A table with 1 million embeddings searches in under 50ms with an ivfflat index. For our use case (documentation search, similar article recommendations), it matches Pinecone’s performance.
Cost: zero additional infrastructure. We already run Postgres.
nomic-embed-text for Embeddings
Nomic AI released nomic-embed-text, an MIT-licensed embedding model that matches OpenAI’s text-embedding-ada-002 quality on most benchmarks. It runs on Ollama.
import ollama
def embed(text: str) -> list[float]:
response = ollama.embeddings(model='nomic-embed-text', prompt=text)
return response['embedding']
Embedding generation moved from $0.0001 per 1,000 tokens to hardware cost (negligible alongside the LLM inference).
Langfuse for Observability
Langfuse is open source LLM observability. Traces, prompt versions, evaluation runs - the full suite that Langsmith provides.
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe()
def generate_summary(text: str) -> str:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Summarize: {text}"}]
)
return response.choices[0].message.content
Self-hosted Langfuse runs on a $20/month VPS. Replaces $3,600/year Langsmith and $1,800/year Helicone. Combined replacement cost: $240/year.
Langfuse Prompt Management
Langfuse also handles prompt versioning - storing, versioning, and A/B testing prompt templates. This replaced PromptLayer.
What We Kept (And Why)
OpenAI API access for customer-facing features: $22,000/year. We tried replacing with Mistral Large for some of these. The quality difference was visible in user testing for our specific use cases (nuanced writing assistance). We kept the API.
The $22K is justified. The $28K on internal tooling was not.
The Migration Effort
Real costs of the migration:
| Task | Engineering hours |
|---|---|
| pgvector setup and embedding migration | 12 hours |
| Ollama deployment and model evaluation | 8 hours |
| Langfuse self-hosted setup | 6 hours |
| Updating integration code | 20 hours |
| Testing and validation | 15 hours |
| Total | 61 hours |
At a blended rate of $150/hour: $9,150 in engineering time. The first-year savings were $38,000. The migration paid for itself in 90 days.
The Tradeoffs
Self-hosting has real costs beyond money:
- Infrastructure responsibility: Ollama server needs monitoring, updates, scaling
- Model updates require evaluation and deployment work
- No vendor SLA for open source components
- GPU procurement and management is a skill some teams lack
If your team does not have infrastructure engineering capacity, self-hosting the LLM inference layer is not free - the labor cost is real. Start with pgvector and Langfuse (easy to self-host, no GPU required) before taking on local model inference.
Bottom Line
The open source AI tooling ecosystem matured rapidly in 2024-2025. Self-hosting vector storage with pgvector, running open models with Ollama, and deploying Langfuse for observability replaces commercial equivalents with a fraction of the infrastructure cost. The migration takes 60-80 engineering hours. For most teams spending over $30K annually on AI SaaS, the payback period is under six months.
Comments