How to Set Up LiteLLM as Your AI Gateway
Use LiteLLM to access 100+ AI models through a single unified API.
Jay Banlasan
The AI Systems Guy
Managing direct integrations with 5 different AI providers is a maintenance nightmare. Every provider has a different SDK, different authentication, different request format, and different error types. LiteLLM solves this by giving you one OpenAI-compatible API that proxies to any provider you configure. You write your code once against the OpenAI interface and switch models by changing a string.
I use LiteLLM as the gateway layer in every production AI system I build now. It handles fallbacks, load balancing, cost tracking, and caching out of the box. What used to take 400 lines of custom code is now a config file and 20 lines of integration code.
What You Need Before Starting
- Python 3.9+
- API keys for the providers you want to use
- Docker (optional, for running LiteLLM as a proxy server)
litellmPython package
Step 1: Install LiteLLM
pip install litellm
For the proxy server (recommended for team environments):
pip install "litellm[proxy]"
Step 2: Basic Usage - Call Any Model with One Interface
LiteLLM wraps any provider in the OpenAI message format.
import litellm
import os
# Set your API keys
os.environ["OPENAI_API_KEY"] = "your-openai-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["GEMINI_API_KEY"] = "your-google-key"
def complete(model: str, messages: list, **kwargs) -> str:
response = litellm.completion(
model=model,
messages=messages,
**kwargs
)
return response.choices[0].message.content
# Same code, different models
answer = complete("gpt-4o", [{"role": "user", "content": "What is 2+2?"}])
answer = complete("claude-3-5-sonnet-20241022", [{"role": "user", "content": "What is 2+2?"}])
answer = complete("gemini/gemini-1.5-pro", [{"role": "user", "content": "What is 2+2?"}])
answer = complete("groq/llama-3.1-70b-versatile", [{"role": "user", "content": "What is 2+2?"}])
All four calls use the identical interface. Switching providers is a one-word change.
Step 3: Configure Fallbacks
Define a fallback chain in your completion call.
import litellm
litellm.set_verbose = False
def complete_with_fallback(
messages: list,
primary_model: str = "gpt-4o-mini",
fallback_models: list = None,
**kwargs
) -> dict:
if fallback_models is None:
fallback_models = ["claude-3-haiku-20240307", "gpt-4o"]
response = litellm.completion(
model=primary_model,
messages=messages,
fallbacks=fallback_models,
**kwargs
)
return {
"content": response.choices[0].message.content,
"model_used": response.model,
"usage": {
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens
}
}
result = complete_with_fallback(
messages=[{"role": "user", "content": "Summarize the concept of compounding interest."}]
)
print(f"Response from: {result['model_used']}")
print(result["content"])
Step 4: Enable Cost Tracking
LiteLLM tracks costs automatically. Read them from the response.
import litellm
litellm.success_callback = []
litellm.failure_callback = []
# Enable cost calculation
litellm.set_verbose = False
def tracked_complete(messages: list, model: str = "gpt-4o-mini") -> dict:
response = litellm.completion(model=model, messages=messages)
cost = litellm.completion_cost(completion_response=response)
return {
"content": response.choices[0].message.content,
"model": response.model,
"cost_usd": round(cost, 6),
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens
}
result = tracked_complete([{"role": "user", "content": "Write a haiku about APIs."}])
print(f"Cost: ${result['cost_usd']}")
Step 5: Set Up the LiteLLM Proxy Server
For team environments, run LiteLLM as a proxy server that your whole team calls instead of individual providers.
Create litellm_config.yaml:
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-haiku
litellm_params:
model: anthropic/claude-3-haiku-20240307
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
routing_strategy: least-busy
num_retries: 3
timeout: 30
litellm_settings:
success_callback: ["langfuse"] # Optional: observability
drop_params: true # Drop unsupported params instead of erroring
request_timeout: 60
general_settings:
master_key: your-proxy-master-key # Required for API auth
database_url: "sqlite:///./litellm.db" # For budget tracking
Start the proxy:
litellm --config litellm_config.yaml --port 4000
Step 6: Connect to the Proxy with the OpenAI SDK
Once the proxy is running, point the OpenAI client at it.
from openai import OpenAI
# Point to LiteLLM proxy instead of OpenAI directly
proxy_client = OpenAI(
api_key="your-proxy-master-key",
base_url="http://localhost:4000"
)
# Now call any model you configured
response = proxy_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
# Switch to Claude with zero code changes
response = proxy_client.chat.completions.create(
model="claude-sonnet",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
Your entire codebase uses the OpenAI SDK. Provider switching is a config file change.
Step 7: Add Budget Controls via the Proxy
Set per-user or per-team spending limits through the proxy API.
import requests
PROXY_URL = "http://localhost:4000"
PROXY_KEY = "your-proxy-master-key"
def create_budget_key(
user_id: str,
monthly_budget_usd: float,
models_allowed: list = None
) -> str:
"""Create an API key with a monthly spending limit."""
response = requests.post(
f"{PROXY_URL}/key/generate",
headers={"Authorization": f"Bearer {PROXY_KEY}"},
json={
"user_id": user_id,
"max_budget": monthly_budget_usd,
"budget_duration": "monthly",
"models": models_allowed or ["gpt-4o-mini", "claude-haiku"],
"metadata": {"created_for": user_id}
}
)
return response.json()["key"]
def get_spend_report() -> dict:
"""Get current spend across all keys."""
response = requests.get(
f"{PROXY_URL}/global/spend",
headers={"Authorization": f"Bearer {PROXY_KEY}"}
)
return response.json()
# Create a restricted key for a team member
team_key = create_budget_key("team-member-1", monthly_budget_usd=10.0)
print(f"Team key: {team_key}")
What to Build Next
- Add Langfuse or Helicone observability to the proxy so you have per-request tracing, latency histograms, and quality scores in one dashboard
- Configure model groups in the proxy config so
model: "fast"automatically routes to your cheapest available model andmodel: "smart"routes to your best - Set up Redis caching in the proxy config so semantically similar requests are deduplicated at the gateway level before they reach any provider
Related Reading
- How to Build a Multi-Model AI Router - build custom routing on top of LiteLLM for task-type awareness
- How to Implement Cost-Based AI Model Selection - cost-based selection logic integrates cleanly with the LiteLLM proxy
- How to Build an AI Load Balancer Across Providers - LiteLLM's built-in load balancing can replace a custom implementation for most use cases
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment