How to Set Up Replicate for Model Hosting

Run open-source AI models on demand with Replicate serverless GPU platform.

Jay Banlasan

The AI Systems Guy

Replicate model hosting setup solves a specific problem: you need to run an open-source model like Llama 3, SDXL, or Whisper without standing up your own GPU infrastructure. Replicate runs the model on demand, charges per second of compute, and you pay nothing when the model is not running. I use this for image generation pipelines and audio transcription where GPU costs at idle would kill the margin.

The platform has two use cases: running public models that Replicate already hosts, and deploying your own custom models. Start with public models first. The API is clean, the cold start is the main tradeoff, and fine-tuning is built in for supported architectures.

What You Need Before Starting

A Replicate account at replicate.com
Your API token from replicate.com/account/api-tokens
Python 3.10+ with replicate library (pip install replicate)
Set REPLICATE_API_TOKEN in your environment

Step 1: Run Your First Public Model

Replicate has thousands of public models. Every model has a version string that pins the exact weights and container. Always pin versions in production.

import replicate
import os

# The version string pins exact model weights
output = replicate.run(
    "meta/meta-llama-3-8b-instruct",
    input={
        "prompt": "Explain serverless GPU inference in two sentences.",
        "max_new_tokens": 200,
        "temperature": 0.7
    }
)

# Most text models return an iterator of string chunks
result = "".join(output)
print(result)

Find the model identifier on any model's page on replicate.com. The format is {owner}/{model-name} or {owner}/{model-name}:{version-hash} for a pinned version.

Step 2: Pin a Specific Version for Production

Model versions on Replicate are immutable. If you do not pin, the model owner can update weights and change your outputs.

import replicate

def run_llama_pinned(prompt: str, max_tokens: int = 500) -> str:
    # Pinned version hash from the model's versions tab
    output = replicate.run(
        "meta/meta-llama-3-8b-instruct:dd9d8d3d99a02ae8d5f7a91c7e63044a6d7f3d9e0b8b08d6e0b3b4e2f9a1c3d5",
        input={
            "prompt": prompt,
            "max_new_tokens": max_tokens,
            "temperature": 0.7,
            "top_p": 0.9,
            "repetition_penalty": 1.1
        }
    )
    return "".join(output)

Get the version hash from the model page under "Versions." Copy the full hash.

Step 3: Run Async Predictions for Long Tasks

For anything that takes more than a few seconds, use async predictions instead of blocking calls. This is essential for image generation and video processing.

import replicate
import time

def run_async_prediction(model_version: str, input_params: dict) -> dict:
    # Start the prediction without waiting
    prediction = replicate.predictions.create(
        version=model_version,
        input=input_params
    )

    print(f"Prediction started: {prediction.id}")
    print(f"Status: {prediction.status}")

    # Poll until complete
    while prediction.status not in ["succeeded", "failed", "canceled"]:
        time.sleep(2)
        prediction.reload()
        print(f"Status: {prediction.status}")

    if prediction.status == "failed":
        raise RuntimeError(f"Prediction failed: {prediction.error}")

    return prediction.output

# Example: SDXL image generation
output = run_async_prediction(
    "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
    {
        "prompt": "Professional headshot, business attire, clean background, photorealistic",
        "negative_prompt": "blurry, low quality, cartoon, illustration",
        "width": 1024,
        "height": 1024,
        "num_outputs": 1
    }
)

print(f"Image URL: {output[0]}")

Step 4: Set Up Webhooks for Production Workflows

Polling works for testing. Production should use webhooks so your server gets notified when the prediction completes.

import replicate
from flask import Flask, request, jsonify

app = Flask(__name__)

def start_prediction_with_webhook(model_version: str, input_params: dict, webhook_url: str) -> str:
    prediction = replicate.predictions.create(
        version=model_version,
        input=input_params,
        webhook=webhook_url,
        webhook_events_filter=["completed"]  # Only notify on completion
    )
    return prediction.id

@app.route("/webhook/replicate", methods=["POST"])
def handle_replicate_webhook():
    data = request.json
    prediction_id = data.get("id")
    status = data.get("status")
    output = data.get("output")

    if status == "succeeded":
        # Process the output
        print(f"Prediction {prediction_id} succeeded: {output}")
        # Trigger your downstream processing here
        process_output(prediction_id, output)
    elif status == "failed":
        error = data.get("error")
        print(f"Prediction {prediction_id} failed: {error}")
        handle_failure(prediction_id, error)

    return jsonify({"received": True}), 200

def process_output(prediction_id: str, output):
    # Your business logic here
    pass

def handle_failure(prediction_id: str, error: str):
    # Alert, retry logic, etc.
    pass

The webhook URL must be publicly accessible. Use ngrok locally for testing.

Step 5: Fine-Tune a Model on Your Data

Replicate has built-in fine-tuning for several model architectures including SDXL and Llama. This lets you train a custom version on your data and deploy it as a new model version.

import replicate

def start_fine_tune(
    training_data_url: str,  # URL to a .zip of images or a JSONL file
    model_name: str = "my-custom-llama",
    base_model: str = "meta/meta-llama-3-8b"
) -> str:
    training = replicate.trainings.create(
        version=f"{base_model}:training",
        input={
            "train_data": training_data_url,
            "num_train_epochs": 3,
            "learning_rate": 2e-4,
            "batch_size": 4
        },
        destination=f"your-username/{model_name}"
    )

    print(f"Training started: {training.id}")
    print(f"Model will be at: your-username/{model_name}")
    return training.id

Training input format varies by model. Check the specific model's documentation for the expected data format. Image fine-tunes typically need a ZIP of images with a metadata.jsonl describing them.

Step 6: Build a Cost Tracking Wrapper

Replicate charges per second. Track costs per operation so you can report back to clients or set budget alerts.

import replicate
import time
from dataclasses import dataclass

# Approximate costs per second by GPU type (check replicate.com/pricing for current rates)
GPU_COSTS_PER_SECOND = {
    "T4": 0.000225,
    "A40": 0.000575,
    "A100": 0.001150,
}

@dataclass
class PredictionResult:
    output: any
    duration_seconds: float
    estimated_cost_usd: float
    prediction_id: str

def tracked_prediction(model_version: str, input_params: dict, gpu_type: str = "A40") -> PredictionResult:
    start_time = time.time()

    prediction = replicate.predictions.create(
        version=model_version,
        input=input_params
    )

    while prediction.status not in ["succeeded", "failed", "canceled"]:
        time.sleep(1)
        prediction.reload()

    end_time = time.time()
    duration = prediction.metrics.get("predict_time", end_time - start_time)
    cost = duration * GPU_COSTS_PER_SECOND.get(gpu_type, GPU_COSTS_PER_SECOND["A40"])

    if prediction.status == "failed":
        raise RuntimeError(f"Failed: {prediction.error}")

    return PredictionResult(
        output=prediction.output,
        duration_seconds=duration,
        estimated_cost_usd=cost,
        prediction_id=prediction.id
    )

# Usage
result = tracked_prediction(
    "stability-ai/sdxl:39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b",
    {"prompt": "Modern office interior, photorealistic"},
    gpu_type="A40"
)
print(f"Cost: ${result.estimated_cost_usd:.4f} | Time: {result.duration_seconds:.1f}s")

The actual cost is in your Replicate billing dashboard. prediction.metrics["predict_time"] gives you the model execution time which is what Replicate bills.

What to Build Next

Build a caching layer that stores prediction outputs so identical inputs do not re-run the model
Set up a queue system (Redis + Celery) to batch image generation requests and stay under rate limits
Create a model comparison tool that runs the same prompt through multiple model versions and scores outputs

How to Set Up Replicate for Model Hosting

What You Need Before Starting

Step 1: Run Your First Public Model

Step 2: Pin a Specific Version for Production

Step 3: Run Async Predictions for Long Tasks

Step 4: Set Up Webhooks for Production Workflows

Step 5: Fine-Tune a Model on Your Data

Step 6: Build a Cost Tracking Wrapper

What to Build Next

Related Reading

Related Systems

How to Set Up Your First Claude API Call

How to Build a Multi-Turn Conversation with Claude

How to Set Up Groq for Ultra-Fast AI Inference