How to Install and Run Local LLMs with Ollama
Install Ollama and run open-source language models locally on your machine.
Jay Banlasan
The AI Systems Guy
Running a local LLM with Ollama is the move for anything you cannot send to a cloud API. Client contracts with data confidentiality clauses, internal HR documents, financial records, anything where you cannot let the data leave your machine. The run local llm ollama setup guide takes about 10 minutes to complete, and once it is done you have a local AI server that behaves like an API you control completely.
Ollama is also the right call for volume work where cloud API costs add up. Once the model is running locally, every call is essentially free. The tradeoff is quality and speed, which is why I use Ollama for classification and extraction tasks where Llama 3 or Mistral is good enough, and reserve Claude/GPT-4 for tasks that need top-tier reasoning.
What You Need Before Starting
- A machine with at least 8GB RAM (16GB recommended for 7B models)
- macOS, Linux, or Windows (WSL2 recommended on Windows)
- About 5-10GB of disk space per model
- Basic terminal comfort
- Python 3.9+ for the API integration steps
Step 1: Install Ollama
On macOS:
curl -fsSL https://ollama.com/install.sh | sh
Or download the macOS app from ollama.com.
On Linux:
curl -fsSL https://ollama.com/install.sh | sh
On Windows: Download the installer from ollama.com. It installs as a background service and adds ollama to your PATH. If you use WSL2, run the Linux install command inside your WSL terminal.
Verify the install:
ollama --version
You should see a version number. Ollama automatically starts a local server at http://localhost:11434.
Step 2: Pull and Run Your First Model
Download Llama 3.1 8B (good balance of speed and quality, about 4.7GB):
ollama pull llama3.1
For a smaller, faster model (great for classification on older hardware):
ollama pull llama3.2:3b
For code tasks:
ollama pull codellama
Run a model in interactive chat mode to test it:
ollama run llama3.1
Type a message and press Enter. Type /bye to exit.
Step 3: Use the REST API Directly
Ollama exposes a REST API at http://localhost:11434. You can hit it with any HTTP client:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.1",
"prompt": "What are the benefits of running AI models locally?",
"stream": false
}'
For chat-style (with message history):
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a helpful business analyst."},
{"role": "user", "content": "Summarize the benefits of local AI for data privacy."}
],
"stream": false
}'
Step 4: Connect Ollama to Python
Install the official Python library:
pip install ollama
Basic usage:
import ollama
response = ollama.chat(
model="llama3.1",
messages=[
{
"role": "system",
"content": "You are a document analyst. Be concise and specific."
},
{
"role": "user",
"content": "What are the key clauses I should look for in an NDA?"
}
]
)
print(response["message"]["content"])
Step 5: Build a Local AI Function That Mirrors the Cloud Pattern
I build local functions to match the same signature as my cloud AI wrappers. This way I can swap models without changing application code:
import ollama
from typing import Optional
def ask_local(
prompt: str,
system_prompt: Optional[str] = None,
model: str = "llama3.1",
temperature: float = 0.3
) -> str:
"""
Send a prompt to a local Ollama model.
Args:
prompt: User message
system_prompt: Optional system instructions
model: Ollama model name (must be pulled first)
temperature: 0.0 deterministic, 1.0 creative
Returns:
Response text
"""
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": prompt})
response = ollama.chat(
model=model,
messages=messages,
options={"temperature": temperature}
)
return response["message"]["content"]
# Example: Classify documents without sending data to the cloud
def classify_document(text: str) -> str:
system = "Classify this document as one of: CONTRACT, INVOICE, REPORT, EMAIL, OTHER. Return only the category word."
return ask_local(text, system_prompt=system, temperature=0.0)
if __name__ == "__main__":
doc_snippet = "This agreement is entered into as of January 1, 2024 between Party A and Party B..."
result = classify_document(doc_snippet)
print(f"Document type: {result}")
Step 6: List Available Models and Manage Storage
See what models you have pulled:
ollama list
Check a model's details:
ollama show llama3.1
Remove a model to free disk space:
ollama rm llama3.2:3b
In Python, list models programmatically:
import ollama
models = ollama.list()
for model in models["models"]:
size_gb = model["size"] / (1024**3)
print(f"{model['name']}: {size_gb:.1f}GB")
Step 7: Run Ollama as a Background Service
On Linux, set it to start automatically:
sudo systemctl enable ollama
sudo systemctl start ollama
Check status:
sudo systemctl status ollama
On macOS, Ollama runs as a menu bar app and starts at login automatically.
What to Build Next
- Create a document processing pipeline that routes sensitive documents to Ollama and everything else to Claude
- Run Ollama on a VPS and expose it to your internal network for team use
- Use Ollama's model library to test different open-source models against your specific tasks before committing to one
Related Reading
- How to Handle AI API Rate Limits Gracefully - Rate limiting patterns apply even to local models under load
- How to Create AI API Keys Securely - Not needed for local models, but relevant when you mix local and cloud
- How to Set Up Groq for Ultra-Fast AI Inference - If local speed is not enough, Groq runs the same open-source models faster
Want this system built for your business?
Get a free assessment. We will map every system your business needs and show you the ROI.
Get Your Free Assessment