How to Set Up LiteLLM as Your AI Gateway

Use LiteLLM to access 100+ AI models through a single unified API.

Jay Banlasan

The AI Systems Guy

Managing direct integrations with 5 different AI providers is a maintenance nightmare. Every provider has a different SDK, different authentication, different request format, and different error types. LiteLLM solves this by giving you one OpenAI-compatible API that proxies to any provider you configure. You write your code once against the OpenAI interface and switch models by changing a string.

I use LiteLLM as the gateway layer in every production AI system I build now. It handles fallbacks, load balancing, cost tracking, and caching out of the box. What used to take 400 lines of custom code is now a config file and 20 lines of integration code.

What You Need Before Starting

Python 3.9+
API keys for the providers you want to use
Docker (optional, for running LiteLLM as a proxy server)
litellm Python package

Step 1: Install LiteLLM

pip install litellm

For the proxy server (recommended for team environments):

pip install "litellm[proxy]"

Step 2: Basic Usage - Call Any Model with One Interface

LiteLLM wraps any provider in the OpenAI message format.

import litellm
import os

# Set your API keys
os.environ["OPENAI_API_KEY"] = "your-openai-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["GEMINI_API_KEY"] = "your-google-key"

def complete(model: str, messages: list, **kwargs) -> str:
    response = litellm.completion(
        model=model,
        messages=messages,
        **kwargs
    )
    return response.choices[0].message.content

# Same code, different models
answer = complete("gpt-4o", [{"role": "user", "content": "What is 2+2?"}])
answer = complete("claude-3-5-sonnet-20241022", [{"role": "user", "content": "What is 2+2?"}])
answer = complete("gemini/gemini-1.5-pro", [{"role": "user", "content": "What is 2+2?"}])
answer = complete("groq/llama-3.1-70b-versatile", [{"role": "user", "content": "What is 2+2?"}])

All four calls use the identical interface. Switching providers is a one-word change.

Step 3: Configure Fallbacks

Define a fallback chain in your completion call.

import litellm

litellm.set_verbose = False

def complete_with_fallback(
    messages: list,
    primary_model: str = "gpt-4o-mini",
    fallback_models: list = None,
    **kwargs
) -> dict:
    if fallback_models is None:
        fallback_models = ["claude-3-haiku-20240307", "gpt-4o"]

    response = litellm.completion(
        model=primary_model,
        messages=messages,
        fallbacks=fallback_models,
        **kwargs
    )

    return {
        "content": response.choices[0].message.content,
        "model_used": response.model,
        "usage": {
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens
        }
    }

result = complete_with_fallback(
    messages=[{"role": "user", "content": "Summarize the concept of compounding interest."}]
)
print(f"Response from: {result['model_used']}")
print(result["content"])

Step 4: Enable Cost Tracking

LiteLLM tracks costs automatically. Read them from the response.

import litellm

litellm.success_callback = []
litellm.failure_callback = []

# Enable cost calculation
litellm.set_verbose = False

def tracked_complete(messages: list, model: str = "gpt-4o-mini") -> dict:
    response = litellm.completion(model=model, messages=messages)

    cost = litellm.completion_cost(completion_response=response)

    return {
        "content": response.choices[0].message.content,
        "model": response.model,
        "cost_usd": round(cost, 6),
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens
    }

result = tracked_complete([{"role": "user", "content": "Write a haiku about APIs."}])
print(f"Cost: ${result['cost_usd']}")

Step 5: Set Up the LiteLLM Proxy Server

For team environments, run LiteLLM as a proxy server that your whole team calls instead of individual providers.

Create litellm_config.yaml:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-haiku
    litellm_params:
      model: anthropic/claude-3-haiku-20240307
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: least-busy
  num_retries: 3
  timeout: 30

litellm_settings:
  success_callback: ["langfuse"]  # Optional: observability
  drop_params: true  # Drop unsupported params instead of erroring
  request_timeout: 60

general_settings:
  master_key: your-proxy-master-key  # Required for API auth
  database_url: "sqlite:///./litellm.db"  # For budget tracking

Start the proxy:

litellm --config litellm_config.yaml --port 4000

Step 6: Connect to the Proxy with the OpenAI SDK

Once the proxy is running, point the OpenAI client at it.

from openai import OpenAI

# Point to LiteLLM proxy instead of OpenAI directly
proxy_client = OpenAI(
    api_key="your-proxy-master-key",
    base_url="http://localhost:4000"
)

# Now call any model you configured
response = proxy_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

# Switch to Claude with zero code changes
response = proxy_client.chat.completions.create(
    model="claude-sonnet",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Your entire codebase uses the OpenAI SDK. Provider switching is a config file change.

Step 7: Add Budget Controls via the Proxy

Set per-user or per-team spending limits through the proxy API.

import requests

PROXY_URL = "http://localhost:4000"
PROXY_KEY = "your-proxy-master-key"

def create_budget_key(
    user_id: str,
    monthly_budget_usd: float,
    models_allowed: list = None
) -> str:
    """Create an API key with a monthly spending limit."""
    response = requests.post(
        f"{PROXY_URL}/key/generate",
        headers={"Authorization": f"Bearer {PROXY_KEY}"},
        json={
            "user_id": user_id,
            "max_budget": monthly_budget_usd,
            "budget_duration": "monthly",
            "models": models_allowed or ["gpt-4o-mini", "claude-haiku"],
            "metadata": {"created_for": user_id}
        }
    )
    return response.json()["key"]

def get_spend_report() -> dict:
    """Get current spend across all keys."""
    response = requests.get(
        f"{PROXY_URL}/global/spend",
        headers={"Authorization": f"Bearer {PROXY_KEY}"}
    )
    return response.json()

# Create a restricted key for a team member
team_key = create_budget_key("team-member-1", monthly_budget_usd=10.0)
print(f"Team key: {team_key}")

What to Build Next

Add Langfuse or Helicone observability to the proxy so you have per-request tracing, latency histograms, and quality scores in one dashboard
Configure model groups in the proxy config so model: "fast" automatically routes to your cheapest available model and model: "smart" routes to your best
Set up Redis caching in the proxy config so semantically similar requests are deduplicated at the gateway level before they reach any provider

How to Set Up LiteLLM as Your AI Gateway

What You Need Before Starting

Step 1: Install LiteLLM

Step 2: Basic Usage - Call Any Model with One Interface

Step 3: Configure Fallbacks

Step 4: Enable Cost Tracking

Step 5: Set Up the LiteLLM Proxy Server

Step 6: Connect to the Proxy with the OpenAI SDK

Step 7: Add Budget Controls via the Proxy

What to Build Next

Related Reading

Related Systems

How to Build a Multi-Model AI Router

How to Optimize Batch AI Processing for Cost

How to Build AI Request Throttling Systems