Systems Library / AI Model Setup / How to Set Up Groq for Ultra-Fast AI Inference
AI Model Setup foundations

How to Set Up Groq for Ultra-Fast AI Inference

Configure Groq hardware-accelerated API for sub-second AI responses.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

Groq is the fastest AI inference provider available right now. The groq api fast inference setup delivers responses at 500 to 800 tokens per second, which means a 200-word answer comes back in under a second. I use Groq when latency is the constraint: real-time voice assistants, streaming chatbots where users watch words appear, and classification pipelines where I am waiting on AI to continue downstream processing. The models are open-source (Llama, Mixtral, Gemma) but the hardware is proprietary silicon that makes them run absurdly fast.

The tradeoff is rate limits. Groq's free tier has aggressive token-per-minute caps. For production volume, you will need a paid plan. But for latency-sensitive use cases, nothing else comes close.

What You Need Before Starting

Step 1: Get Your API Key

Go to console.groq.com. Sign up and log in. Navigate to "API Keys" in the left sidebar. Click "Create API Key." Name it. Copy it.

Add to .env:

GROQ_API_KEY=gsk_your-key-here

Install the SDK:

pip install groq python-dotenv

Groq also supports the OpenAI SDK with a base URL swap, covered in step 4.

Step 2: Make Your First Request

import os
from groq import Groq
from dotenv import load_dotenv
import time

load_dotenv()

client = Groq(api_key=os.getenv("GROQ_API_KEY"))


def ask_groq(prompt: str, model: str = "llama-3.1-8b-instant", system: str = None) -> dict:
    """
    Send a prompt to Groq and return the response with timing info.
    
    Args:
        prompt: User message
        model: Groq model to use
        system: Optional system prompt
    
    Returns:
        Dict with 'text', 'tokens_per_second', 'total_tokens'
    """
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    start = time.time()
    
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=500
    )
    
    elapsed = time.time() - start
    
    content = response.choices[0].message.content
    total_tokens = response.usage.total_tokens
    tps = total_tokens / elapsed if elapsed > 0 else 0
    
    return {
        "text": content,
        "elapsed_seconds": round(elapsed, 2),
        "total_tokens": total_tokens,
        "tokens_per_second": round(tps, 0)
    }


result = ask_groq("What are 5 ways to reduce customer churn for a SaaS company?")
print(result["text"])
print(f"\nSpeed: {result['tokens_per_second']} tokens/second in {result['elapsed_seconds']}s")

You will typically see 500-800+ tokens per second, compared to 50-100 for GPT-4.

Step 3: Know Your Model Options

Groq serves several open-source models. Choose based on the task:

GROQ_MODELS = {
    # Fast and light - classification, simple Q&A, quick tasks
    "llama-3.1-8b-instant": {
        "use_case": "Fast responses, classification, simple tasks",
        "context_window": 128000,
        "speed": "fastest"
    },
    
    # Balanced - general business tasks
    "llama-3.1-70b-versatile": {
        "use_case": "General tasks requiring good reasoning",
        "context_window": 128000,
        "speed": "fast"
    },
    
    # Code tasks
    "llama3-groq-8b-8192-tool-use-preview": {
        "use_case": "Function calling and tool use",
        "context_window": 8192,
        "speed": "fast"
    },
    
    # Highest quality on Groq
    "llama-3.1-70b-versatile": {
        "use_case": "Complex reasoning, analysis",
        "context_window": 128000,
        "speed": "moderate-fast"
    }
}


def pick_model(task_type: str) -> str:
    """Return the right Groq model for a task type."""
    routing = {
        "classify": "llama-3.1-8b-instant",
        "extract": "llama-3.1-8b-instant",
        "chat": "llama-3.1-70b-versatile",
        "analyze": "llama-3.1-70b-versatile",
        "code": "llama3-groq-8b-8192-tool-use-preview",
        "summarize": "llama-3.1-8b-instant"
    }
    return routing.get(task_type, "llama-3.1-8b-instant")

Step 4: Use the OpenAI SDK with Groq

If you already have OpenAI code, swap in Groq with a base URL change:

from openai import OpenAI

groq_via_openai = OpenAI(
    api_key=os.getenv("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1"
)

response = groq_via_openai.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Classify this email: billing or support?"}],
    max_tokens=10
)

print(response.choices[0].message.content)

Step 5: Build a Real-Time Chat Interface with Groq Streaming

Groq's streaming is noticeably faster than other providers. Words appear almost instantly:

def stream_groq(prompt: str, system: str = None) -> str:
    """Stream a Groq response with live token output."""
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    
    print("Response: ", end="", flush=True)
    full_response = ""
    
    with client.chat.completions.create(
        model="llama-3.1-70b-versatile",
        messages=messages,
        max_tokens=1000,
        stream=True
    ) as stream:
        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                print(token, end="", flush=True)
                full_response += token
    
    print()
    return full_response


stream_groq(
    "Explain how LLM inference hardware works in simple terms.",
    system="You are a technical explainer. Keep it under 200 words."
)

Step 6: Handle Groq Rate Limits

Groq's free tier limits are strict (around 6,000 tokens per minute on some models). Handle them properly:

from groq import RateLimitError
import time


def groq_call_safe(prompt: str, model: str = "llama-3.1-8b-instant", max_retries: int = 4) -> str:
    """Groq call with exponential backoff on rate limit errors."""
    
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=500
            )
            return response.choices[0].message.content
        
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            
            wait = 2 ** (attempt + 1)  # 2s, 4s, 8s, 16s
            print(f"Groq rate limited. Waiting {wait}s...")
            time.sleep(wait)
    
    raise Exception("Max retries exceeded")


# For high-volume tasks, batch requests with delays
def batch_classify(texts: list[str], delay_seconds: float = 0.5) -> list[str]:
    """Classify multiple texts with rate limiting."""
    results = []
    
    for text in texts:
        result = groq_call_safe(
            f"Classify as POSITIVE, NEGATIVE, or NEUTRAL. Return one word only. Text: {text}",
            model="llama-3.1-8b-instant"
        )
        results.append(result.strip())
        time.sleep(delay_seconds)
    
    return results

Step 7: Build a Speed Comparison Test

Benchmark Groq against your existing provider:

def benchmark_providers(prompt: str) -> dict:
    """Compare response speed across providers."""
    results = {}
    
    # Groq
    start = time.time()
    groq_response = ask_groq(prompt, model="llama-3.1-70b-versatile")
    results["groq_llama70b"] = {
        "time": round(time.time() - start, 2),
        "preview": groq_response["text"][:100]
    }
    
    # OpenAI for comparison
    from openai import OpenAI as OAI
    oai = OAI(api_key=os.getenv("OPENAI_API_KEY"))
    start = time.time()
    oai_resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200
    )
    results["openai_gpt4o_mini"] = {
        "time": round(time.time() - start, 2),
        "preview": oai_resp.choices[0].message.content[:100]
    }
    
    for provider, data in results.items():
        print(f"{provider}: {data['time']}s")
    
    return results


benchmark_providers("What are 3 key metrics for a SaaS business?")

What to Build Next

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems