LLaMA 3 from Meta is the leading open-source model family, and it's genuinely capable for production use cases. The appeal isn't competing with frontier models on benchmarks — it's running on your own hardware, with your own data, at zero per-token cost.
Why LLaMA 3 for Local Inference
No API costs. Running LLaMA 3 locally means no per-token fees. For high-volume workloads — processing thousands of documents, building products, experimenting rapidly — this changes the economics entirely.
Data privacy. No data leaves your infrastructure. Critical for regulated industries (healthcare, finance, legal), processing sensitive customer data, or working under data residency requirements.
Fine-tuning control. Open weights means you can fine-tune on your own data to specialize the model for your use case — something you can't do with closed-weight models.
Offline capability. Once downloaded, runs without internet. Useful for embedded systems, air-gapped environments, or edge deployment.
Getting Started With Ollama
Ollama is the simplest way to run LLaMA 3 locally. It handles model download, quantization, and a local API server.
Install and run:
# Install Ollama (macOS)
brew install ollama
# Pull and run LLaMA 3 8B
ollama pull llama3.2
ollama run llama3.2
Use the API (OpenAI-compatible):
from openai import OpenAI
# Point the OpenAI client to your local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but unused
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in 3 sentences."}
]
)
print(response.choices[0].message.content)
Ollama's API is OpenAI-compatible, so any code written for the OpenAI SDK works with a simple URL swap.
Model Size Selection
| Model | Parameters | RAM Required | Use For |
|---|---|---|---|
| Llama 3.2 1B | 1B | ~1GB | Edge devices, simple classification |
| Llama 3.2 3B | 3B | ~2GB | Mobile, fast responses |
| Llama 3.2 8B | 8B | ~5GB | General use, development |
| Llama 3.3 70B | 70B | ~40GB | Production, complex reasoning |
| Llama 3.1 405B | 405B | ~230GB | Near-frontier tasks |
For most development: Start with the 8B model for speed and iteration. Move to 70B when you need higher quality.
Quantization: Ollama automatically uses quantized versions. A Q4 quantized 70B model runs in ~40GB RAM with minimal quality loss vs. the full precision model.
Prompting LLaMA 3
LLaMA 3 uses a specific chat template format. When using Ollama or the transformers library, this is handled automatically. But understanding it helps when fine-tuning or working directly with the model.
The LLaMA 3 instruct format:
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
{system prompt}
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{user message}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
When using the chat API (Ollama, together.ai, etc.), you just write normal system/user/assistant messages and the template is handled for you.
Effective system prompts for LLaMA 3:
LLaMA 3 follows system prompt instructions reliably. Structure them clearly:
You are a Python code reviewer. Your job is to review code for bugs,
security issues, and style problems.
Rules:
- Point out problems specifically (line numbers when possible)
- Suggest fixes, not just problems
- Prioritize: security > correctness > performance > style
- Do not add praise or filler text
- If the code has no problems, say so directly
Response format:
Issue 1: [type] line N — [description]
Fix: [specific change]
...
When to Use LLaMA 3 vs. Hosted APIs
Use LLaMA 3 locally when:
- Processing sensitive or private data
- High volume (thousands of documents per day)
- Rapid experimentation without API cost concerns
- You need to fine-tune on proprietary data
- Offline or air-gapped environments
Use hosted frontier models (Claude, GPT-4o, Gemini) when:
- You need maximum capability on complex tasks
- Long context (>128K tokens)
- Native multimodal reasoning
- Reliability and uptime matter more than cost
- You don't have local GPU resources
Consider cloud-hosted LLaMA (Groq, Together AI, Replicate) when:
- You want open-source model characteristics (cost, control)
- But don't have local GPU hardware
- Latency matters (Groq's LPU hardware runs LLaMA at extremely high speeds)
Fine-Tuning Basics
If LLaMA 3's default behavior doesn't match your use case, fine-tuning on your own data is an option not available with closed models.
When fine-tuning makes sense:
- Consistent output format across thousands of calls
- Domain-specific knowledge that's not in training data
- Style matching (writing in a specific brand voice)
- When you have labeled examples of exactly the input/output you want
Minimum viable fine-tuning (LoRA with unsloth):
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048,
load_in_4bit=True
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
)
# Fine-tune on your dataset using SFTTrainer...
LoRA fine-tuning runs on consumer GPUs (RTX 4090 or similar) and takes a fraction of the compute of full fine-tuning. The adapter files are small (~100MB) compared to the full model.
Common Mistakes With LLaMA 3
Expecting frontier-level performance without fine-tuning. LLaMA 3 70B is capable, but falls behind GPT-4o and Claude on nuanced reasoning tasks. If you're hitting consistent failures, either fine-tune or consider a hosted model.
Not testing quantization quality impact. Q4 quantization is usually fine; Q2 can degrade meaningfully. Test your specific task at different quantization levels.
Forgetting to set temperature for deterministic tasks. Same as any model — set temperature to 0 for data extraction and classification tasks, higher for creative tasks.
Not structuring system prompts. LLaMA 3 follows instructions reliably but performs better with explicit, structured system prompts than vague ones.