You’ve used ChatGPT through the web interface. Now you want to build it into your own app - a customer support bot, a code review tool, a content generator. You open the OpenAI docs, see 47 pages of API reference, and close the tab.

It’s actually simpler than it looks. One endpoint, a few lines of code, and you’re calling GPT from your app. This guide covers everything from your first API call to running it in production.

The Basics

The ChatGPT API is a single HTTP endpoint:

POST https://api.openai.com/v1/chat/completions

You send a list of messages (the conversation), and get a response back. That’s it. The API is stateless - it doesn’t remember previous requests. You manage the conversation history yourself.

Authentication: You need an API key from platform.openai.com. Pass it as a Bearer token:

Authorization: Bearer sk-xxxxxxxxxxxx

Never hardcode this in your source code. Use environment variables.

Models and Pricing

Model Input (per 1M tokens) Output (per 1M tokens) Context Window Best For
gpt-4o $3.75 $15.00 128K Complex reasoning, code generation
gpt-4o-mini $0.30 $1.20 128K High-volume simple tasks, classification
gpt-4-turbo $10.00 $30.00 128K Legacy - don’t use for new projects
gpt-3.5-turbo $0.50 $1.50 16K Legacy - gpt-4o-mini is cheaper and better

What’s a token? Roughly 4 characters or 3/4 of a word. “hamburger” is 3 tokens. Both input and output tokens are billed.

Real cost examples:

  • 1,000 customer support conversations (2,000 tokens each) with gpt-4o-mini: ~$1.50
  • Same with gpt-4o: ~$18.75
  • A chatbot handling 10,000 queries/day at 500 tokens each with gpt-4o-mini: **$1.50/day**

For most applications, gpt-4o-mini is the right starting point. It’s 12.5x cheaper than gpt-4o on input and handles classification, extraction, simple Q&A, and summarization perfectly well. Upgrade to gpt-4o only when you need stronger reasoning.

Your First API Call

Python

pip install openai
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain DNS in one paragraph."}
    ],
    temperature=0.7,
    max_tokens=200
)

print(response.choices[0].message.content)

Node.js

npm install openai
import OpenAI from 'openai';

const client = new OpenAI(); // reads OPENAI_API_KEY from environment

const response = await client.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: 'Explain DNS in one paragraph.' }
    ],
    temperature: 0.7,
    max_tokens: 200
});

console.log(response.choices[0].message.content);

curl

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain DNS in one paragraph."}
    ]
  }'

Three languages, same concept. Send messages, get a response.

Message Roles

Every message has a role that tells the model who’s speaking:

Role Purpose Example
system Sets behavior, personality, constraints “You are a Python expert. Always include error handling.”
user The end user’s message “How do I read a CSV file?”
assistant The model’s previous responses Used for multi-turn conversations
tool Function call results Results sent back after function execution

A multi-turn conversation looks like this:

messages = [
    {"role": "system", "content": "You are a senior Python developer."},
    {"role": "user", "content": "How do I read a CSV?"},
    {"role": "assistant", "content": "Use the csv module or pandas..."},
    {"role": "user", "content": "Show me the pandas version."}
]

You send the full conversation every time. The API doesn’t remember anything between requests.

System Prompts

The system message is the most important part of your integration. It controls how the model behaves, what it knows, and what it refuses to do.

A bad system prompt:

You are a helpful assistant.

A good system prompt:

You are a customer support agent for Acme Corp.

## Guidelines
- Be friendly but professional
- Only answer questions about Acme products
- If you don't know something, say so - never make up information
- Keep responses under 3 sentences unless the user asks for detail

## Products
- Widget Pro: $49.99, red/blue/green
- Widget Lite: $29.99, black/white

## Escalation
If the customer is angry or requests a refund, respond with:
"Let me connect you with a specialist who can help."

Best practices:

  • Define identity first (who the assistant is)
  • Be specific about what to do AND what not to do
  • Use Markdown headers and bullet lists for structure
  • Include examples of desired input/output
  • Put static content at the beginning of messages - OpenAI automatically caches repeated prefixes, saving up to 90% on input costs

Streaming

Without streaming, your user stares at a blank screen for 5-30 seconds waiting for the full response. With streaming, the first tokens appear in ~200ms.

Python Streaming

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a short story about a robot."}
    ],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Node.js Streaming

const stream = await client.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
        { role: 'user', content: 'Write a short story about a robot.' }
    ],
    stream: true
});

for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) process.stdout.write(content);
}

How it works under the hood:

The API uses Server-Sent Events (SSE). Each chunk is a line starting with data: containing a JSON object with a delta field instead of the full message. The stream ends with data: [DONE].

data: {"choices":[{"delta":{"role":"assistant"}}]}
data: {"choices":[{"delta":{"content":"Once"}}]}
data: {"choices":[{"delta":{"content":" upon"}}]}
data: {"choices":[{"delta":{"content":" a"}}]}
data: {"choices":[{"delta":{"content":" time"}}]}
...
data: [DONE]

Always stream in production. The UX difference is massive.

Function Calling (Tool Use)

This is where the API gets powerful. You can give the model access to your own functions - weather APIs, databases, internal tools - and it decides when to call them. The function calling guide covers the full specification.

Step 1: Define your tools

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g. London, UK"
                    },
                    "units": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location", "units"],
                "additionalProperties": False
            },
            "strict": True
        }
    }
]

Step 2: Send the request

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's the weather in Delhi?"}],
    tools=tools
)

Step 3: The model asks to call your function

Instead of returning text, the model returns a tool_calls array:

{
  "tool_calls": [{
    "id": "call_abc123",
    "function": {
      "name": "get_weather",
      "arguments": "{\"location\": \"Delhi, India\", \"units\": \"celsius\"}"
    }
  }]
}

Step 4: Execute and send results back

import json

assistant_message = response.choices[0].message
messages.append(assistant_message)

for tool_call in assistant_message.tool_calls:
    args = json.loads(tool_call.function.arguments)
    result = get_weather(**args)  # your actual function

    messages.append({
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(result)
    })

# Get final response with the function results
final = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)
print(final.choices[0].message.content)
# "The current temperature in Delhi is 34C with clear skies."

The model can call multiple functions in a single turn (parallel function calling). Set parallel_tool_calls: false if you need sequential execution.

tool_choice options:

  • "auto" (default) - model decides whether to call a function
  • "required" - must call at least one function
  • "none" - never call functions
  • {"type": "function", "function": {"name": "get_weather"}} - force a specific function

Token Management

The API is stateless. You send the full conversation every time, and it all counts toward your token limit. As conversations grow, you’ll eventually hit the context window ceiling.

Counting Tokens

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, how are you?")
print(len(tokens))  # 5

Managing Conversation Length

When your conversation approaches the context limit, you have three options:

1. Sliding window - drop oldest messages:

def trim_messages(messages, max_tokens=100000):
    """Keep system prompt + most recent messages within token limit."""
    system = [m for m in messages if m["role"] == "system"]
    others = [m for m in messages if m["role"] != "system"]

    while count_tokens(system + others) > max_tokens and len(others) > 2:
        others.pop(0)  # remove oldest non-system message

    return system + others

2. Summarize older messages:

if count_tokens(messages) > 100000:
    old = messages[1:-10]  # everything except system + last 10
    summary = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize this conversation:\n{old}"}]
    )
    messages = (
        [messages[0]]  # system prompt
        + [{"role": "system", "content": f"Previous context: {summary}"}]
        + messages[-10:]  # recent messages
    )

3. Set max_tokens on responses:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    max_tokens=500  # cap response length
)

Always set max_tokens. Without it, the model can generate unlimited output, burning money and time.

Rate Limits and Error Handling

OpenAI enforces rate limits per organization:

Tier Requires Key Limits (gpt-4o-mini)
Free Account 3 RPM, 200 RPD
Tier 1 $5 paid 500 RPM, 200K TPM
Tier 2 $50 + 7 days 5,000 RPM, 2M TPM
Tier 3 $100 + 7 days 5,000 RPM, 4M TPM
Tier 5 $1,000 + 30 days 30,000 RPM, 150M TPM

The Python and Node.js SDKs have built-in retry logic - 2 retries with exponential backoff by default. See the OpenAI rate limits documentation for current tier-specific limits:

# Increase retries globally
client = OpenAI(max_retries=5)

# Or per request
client.with_options(max_retries=5).chat.completions.create(...)

For custom retry logic:

import time
import random
from openai import RateLimitError

def call_with_backoff(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            delay = (2 ** attempt) + random.random()
            time.sleep(delay)

Check remaining limits via response headers:

x-ratelimit-remaining-requests: 499
x-ratelimit-remaining-tokens: 199500
x-ratelimit-reset-requests: 120ms

Cost Optimization

The difference between a $50/month and $5,000/month API bill often comes down to a few decisions:

1. Use the right model. gpt-4o-mini handles 80% of use cases at 1/12th the cost of gpt-4o. Use gpt-4o only for complex reasoning, nuanced analysis, or code generation.

2. Leverage prompt caching. Put static content (system prompt, examples, tool definitions) at the start of your messages. OpenAI automatically caches repeated prefixes - up to 90% savings on cached input tokens. No code changes needed.

3. Set max_tokens. Cap response length. A customer support bot doesn’t need 4,000 token responses.

4. Trim conversation history. Don’t send 50 turns of conversation when the last 10 are enough.

5. Use the Batch API for non-urgent work. 50% discount on requests that can wait up to 24 hours. Great for content generation, data classification, bulk processing.

6. Cache common responses. If 20% of your queries are the same FAQ questions, cache the answers instead of hitting the API every time.

Strategy Savings
gpt-4o-mini instead of gpt-4o 90%
Prompt caching (automatic) Up to 90% on cached input
Batch API 50%
Trim history to last 10 turns 30-60%
Response caching for FAQs 100% on cache hits

Production Architecture

Here’s how to build a real chatbot backend, not a tutorial demo. If you’re new to system design thinking, start with the system design interview structure post.

User
 |
 v
Frontend (React/Next.js)
 |
 v
Your API (Express/FastAPI)
 |
 ├── Auth + rate limiting (per-user)
 ├── Load conversation from DB
 ├── Append user message
 ├── Trim/summarize if too long
 ├── Call OpenAI API (stream)
 ├── Stream response back to user (SSE)
 └── Save assistant response to DB

Database Schema

CREATE TABLE conversations (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE messages (
    id UUID PRIMARY KEY,
    conversation_id UUID REFERENCES conversations(id),
    role VARCHAR(20) NOT NULL,  -- system, user, assistant, tool
    content TEXT NOT NULL,
    tokens INTEGER,
    created_at TIMESTAMP DEFAULT NOW()
);

Streaming to the Frontend (Express.js)

app.post('/api/chat', async (req, res) => {
    const { conversationId, message } = req.body;

    // Set SSE headers
    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache');
    res.setHeader('Connection', 'keep-alive');

    // Load conversation history from DB
    const messages = await loadMessages(conversationId);
    messages.push({ role: 'user', content: message });

    // Stream from OpenAI
    const stream = await client.chat.completions.create({
        model: 'gpt-4o-mini',
        messages,
        stream: true
    });

    let fullResponse = '';
    for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content;
        if (content) {
            fullResponse += content;
            res.write(`data: ${JSON.stringify({ content })}\n\n`);
        }
    }

    // Save to DB after stream completes
    await saveMessage(conversationId, 'user', message);
    await saveMessage(conversationId, 'assistant', fullResponse);

    res.write('data: [DONE]\n\n');
    res.end();
});

Streaming to the Frontend (FastAPI)

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/api/chat")
async def chat(request: ChatRequest):
    messages = await load_messages(request.conversation_id)
    messages.append({"role": "user", "content": request.message})

    async def generate():
        stream = await async_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            stream=True
        )
        full_response = ""
        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                full_response += content
                yield f"data: {json.dumps({'content': content})}\n\n"

        await save_message(request.conversation_id, "assistant", full_response)
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Security Checklist

  • Never expose your API key to the frontend - always proxy through your backend
  • Validate and sanitize user input before sending to the API
  • Implement per-user rate limiting (e.g., 20 requests/minute)
  • Set the user parameter for abuse tracking: user="user_12345"
  • Set up spending alerts and hard caps in the OpenAI dashboard
  • Moderate inputs and outputs if your app is public-facing

Common Mistakes

1. Not streaming. Your users are staring at a loading spinner for 10 seconds. Stream.

2. Sending the full conversation forever. Conversations grow unbounded. At turn 50, you’re sending 50,000 tokens of history per request. Trim or summarize.

3. Using gpt-4o for everything. gpt-4o-mini is 12.5x cheaper and handles simple tasks just as well. Route based on complexity.

4. Not setting max_tokens. The model will happily generate a 4,000 token essay when you wanted a one-sentence answer.

5. Hardcoding API keys. Use environment variables. Always. OPENAI_API_KEY is auto-read by both SDKs.

6. Ignoring function call validation. The model can hallucinate function arguments. Always validate before executing - especially if the function touches your database or external APIs.

7. No error handling. The API returns 429 (rate limit), 500 (server error), and timeouts. The SDKs retry automatically, but you should handle persistent failures gracefully.

8. Building without a system prompt. A good system prompt is the difference between a useful product and a random chatbot. Invest time in it.

Bottom Line

The ChatGPT API is one endpoint, a few lines of code, and a system prompt. Start with gpt-4o-mini, stream everything, manage your conversation history, and add function calling when you need the model to interact with your systems. The hard part isn’t the integration - it’s designing a good system prompt and managing costs at scale. Get those right and you can build surprisingly powerful AI features with minimal infrastructure.