How to Set Up Replicate for Model Hosting
Run open-source AI models on demand with Replicate serverless GPU platform.
Jay Banlasan
The AI Systems Guy
Replicate model hosting setup solves a specific problem: you need to run an open-source model like Llama 3, SDXL, or Whisper without standing up your own GPU infrastructure. Replicate runs the model on demand, charges per second of compute, and you pay nothing when the model is not running. I use this for image generation pipelines and audio transcription where GPU costs at idle would kill the margin.
The platform has two use cases: running public models that Replicate already hosts, and deploying your own custom models. Start with public models first. The API is clean, the cold start is the main tradeoff, and fine-tuning is built in for supported architectures.
What You Need Before Starting
- A Replicate account at replicate.com
- Your API token from replicate.com/account/api-tokens
- Python 3.10+ with
replicatelibrary (pip install replicate) - Set
REPLICATE_API_TOKENin your environment
Step 1: Run Your First Public Model
Replicate has thousands of public models. Every model has a version string that pins the exact weights and container. Always pin versions in production.
import replicate
import os
# The version string pins exact model weights
output = replicate.run(
"meta/meta-llama-3-8b-instruct",
input={
"prompt": "Explain serverless GPU inference in two sentences.",
"max_new_tokens": 200,
"temperature": 0.7
}
)
# Most text models return an iterator of string chunks
result = "".join(output)
print(result)
Find the model identifier on any model's page on replicate.com. The format is {owner}/{model-name} or {owner}/{model-name}:{version-hash} for a pinned version.
Step 2: Pin a Specific Version for Production
Model versions on Replicate are immutable. If you do not pin, the model owner can update weights and change your outputs.
import replicate
def run_llama_pinned(prompt: str, max_tokens: int = 500) -> str:
# Pinned version hash from the model's versions tab
output = replicate.run(
"meta/meta-llama-3-8b-instruct:dd9d8d3d99a02ae8d5f7a91c7e63044a6d7f3d9e0b8b08d6e0b3b4e2f9a1c3d5",
input={
"prompt": prompt,
"max_new_tokens": max_tokens,
"temperature": 0.7,
"top_p": 0.9,
"repetition_penalty": 1.1
}
)
return "".join(output)
Get the version hash from the model page under "Versions." Copy the full hash.
Step 3: Run Async Predictions for Long Tasks
For anything that takes more than a few seconds, use async predictions instead of blocking calls. This is essential for image generation and video processing.
import replicate
import time
def run_async_prediction(model_version: str, input_params: dict) -> dict:
# Start the prediction without waiting
prediction = replicate.predictions.create(
version=model_version,
input=input_params
)
print(f"Prediction started: {prediction.id}")
print(f"Status: {prediction.status}")
# Poll until complete
while prediction.status not in ["succeeded", "failed", "canceled"]:
time.sleep(2)
prediction.reload()
print(f"Status: {prediction.status}")
if prediction.status == "failed":
raise RuntimeError(f"Prediction failed: {prediction.error}")
return prediction.output
# Example: SDXL image generation
output = run_async_prediction(
"stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
{
"prompt": "Professional headshot, business attire, clean background, photorealistic",
"negative_prompt": "blurry, low quality, cartoon, illustration",
"width": 1024,
"height": 1024,
"num_outputs": 1
}
)
print(f"Image URL: {output[0]}")
Step 4: Set Up Webhooks for Production Workflows
Polling works for testing. Production should use webhooks so your server gets notified when the prediction completes.
import replicate
from flask import Flask, request, jsonify
app = Flask(__name__)
def start_prediction_with_webhook(model_version: str, input_params: dict, webhook_url: str) -> str:
prediction = replicate.predictions.create(
version=model_version,
input=input_params,
webhook=webhook_url,
webhook_events_filter=["completed"] # Only notify on completion
)
return prediction.id
@app.route("/webhook/replicate", methods=["POST"])
def handle_replicate_webhook():
data = request.json
prediction_id = data.get("id")
status = data.get("status")
output = data.get("output")
if status == "succeeded":
# Process the output
print(f"Prediction {prediction_id} succeeded: {output}")
# Trigger your downstream processing here
process_output(prediction_id, output)
elif status == "failed":
error = data.get("error")
print(f"Prediction {prediction_id} failed: {error}")
handle_failure(prediction_id, error)
return jsonify({"received": True}), 200
def process_output(prediction_id: str, output):
# Your business logic here
pass
def handle_failure(prediction_id: str, error: str):
# Alert, retry logic, etc.
pass
The webhook URL must be publicly accessible. Use ngrok locally for testing.
Step 5: Fine-Tune a Model on Your Data
Replicate has built-in fine-tuning for several model architectures including SDXL and Llama. This lets you train a custom version on your data and deploy it as a new model version.
import replicate
def start_fine_tune(
training_data_url: str, # URL to a .zip of images or a JSONL file
model_name: str = "my-custom-llama",
base_model: str = "meta/meta-llama-3-8b"
) -> str:
training = replicate.trainings.create(
version=f"{base_model}:training",
input={
"train_data": training_data_url,
"num_train_epochs": 3,
"learning_rate": 2e-4,
"batch_size": 4
},
destination=f"your-username/{model_name}"
)
print(f"Training started: {training.id}")
print(f"Model will be at: your-username/{model_name}")
return training.id
Training input format varies by model. Check the specific model's documentation for the expected data format. Image fine-tunes typically need a ZIP of images with a metadata.jsonl describing them.
Step 6: Build a Cost Tracking Wrapper
Replicate charges per second. Track costs per operation so you can report back to clients or set budget alerts.
import replicate
import time
from dataclasses import dataclass
# Approximate costs per second by GPU type (check replicate.com/pricing for current rates)
GPU_COSTS_PER_SECOND = {
"T4": 0.000225,
"A40": 0.000575,
"A100": 0.001150,
}
@dataclass
class PredictionResult:
output: any
duration_seconds: float
estimated_cost_usd: float
prediction_id: str
def tracked_prediction(model_version: str, input_params: dict, gpu_type: str = "A40") -> PredictionResult:
start_time = time.time()
prediction = replicate.predictions.create(
version=model_version,
input=input_params
)
while prediction.status not in ["succeeded", "failed", "canceled"]:
time.sleep(1)
prediction.reload()
end_time = time.time()
duration = prediction.metrics.get("predict_time", end_time - start_time)
cost = duration * GPU_COSTS_PER_SECOND.get(gpu_type, GPU_COSTS_PER_SECOND["A40"])
if prediction.status == "failed":
raise RuntimeError(f"Failed: {prediction.error}")
return PredictionResult(
output=prediction.output,
duration_seconds=duration,
estimated_cost_usd=cost,
prediction_id=prediction.id
)
# Usage
result = tracked_prediction(
"stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
{"prompt": "Modern office interior, photorealistic"},
gpu_type="A40"
)
print(f"Cost: ${result.estimated_cost_usd:.4f} | Time: {result.duration_seconds:.1f}s")
The actual cost is in your Replicate billing dashboard. prediction.metrics["predict_time"] gives you the model execution time which is what Replicate bills.
What to Build Next
- Build a caching layer that stores prediction outputs so identical inputs do not re-run the model
- Set up a queue system (Redis + Celery) to batch image generation requests and stay under rate limits
- Create a model comparison tool that runs the same prompt through multiple model versions and scores outputs
Related Reading
- Setting Up Multi-Model AI Operations - multi model ai operations setup
- The Operator Model - ai operator model business
- The AI Maturity Model - ai maturity model business
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment