Skip to main content
All Model Guides
Model GuideLLaMAMetalocal LLMOllamaopen source

How to Prompt LLaMA 3: Local Inference and Ollama Setup

LLaMA 3 from Meta is the most capable open-source model family available. Here's how to run it locally with Ollama, write effective prompts, and when to use it over hosted APIs.

5 min read

LLaMA 3 from Meta is the leading open-source model family, and it's genuinely capable for production use cases. The appeal isn't competing with frontier models on benchmarks — it's running on your own hardware, with your own data, at zero per-token cost.


Why LLaMA 3 for Local Inference

No API costs. Running LLaMA 3 locally means no per-token fees. For high-volume workloads — processing thousands of documents, building products, experimenting rapidly — this changes the economics entirely.

Data privacy. No data leaves your infrastructure. Critical for regulated industries (healthcare, finance, legal), processing sensitive customer data, or working under data residency requirements.

Fine-tuning control. Open weights means you can fine-tune on your own data to specialize the model for your use case — something you can't do with closed-weight models.

Offline capability. Once downloaded, runs without internet. Useful for embedded systems, air-gapped environments, or edge deployment.


Getting Started With Ollama

Ollama is the simplest way to run LLaMA 3 locally. It handles model download, quantization, and a local API server.

Install and run:

# Install Ollama (macOS)
brew install ollama

# Pull and run LLaMA 3 8B
ollama pull llama3.2
ollama run llama3.2

Use the API (OpenAI-compatible):

from openai import OpenAI

# Point the OpenAI client to your local Ollama server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but unused
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in 3 sentences."}
    ]
)
print(response.choices[0].message.content)

Ollama's API is OpenAI-compatible, so any code written for the OpenAI SDK works with a simple URL swap.


Model Size Selection

ModelParametersRAM RequiredUse For
Llama 3.2 1B1B~1GBEdge devices, simple classification
Llama 3.2 3B3B~2GBMobile, fast responses
Llama 3.2 8B8B~5GBGeneral use, development
Llama 3.3 70B70B~40GBProduction, complex reasoning
Llama 3.1 405B405B~230GBNear-frontier tasks

For most development: Start with the 8B model for speed and iteration. Move to 70B when you need higher quality.

Quantization: Ollama automatically uses quantized versions. A Q4 quantized 70B model runs in ~40GB RAM with minimal quality loss vs. the full precision model.


Prompting LLaMA 3

LLaMA 3 uses a specific chat template format. When using Ollama or the transformers library, this is handled automatically. But understanding it helps when fine-tuning or working directly with the model.

The LLaMA 3 instruct format:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>

{system prompt}
<|eot_id|>
<|start_header_id|>user<|end_header_id|>

{user message}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

When using the chat API (Ollama, together.ai, etc.), you just write normal system/user/assistant messages and the template is handled for you.

Effective system prompts for LLaMA 3:

LLaMA 3 follows system prompt instructions reliably. Structure them clearly:

You are a Python code reviewer. Your job is to review code for bugs,
security issues, and style problems.

Rules:
- Point out problems specifically (line numbers when possible)
- Suggest fixes, not just problems
- Prioritize: security > correctness > performance > style
- Do not add praise or filler text
- If the code has no problems, say so directly

Response format:
Issue 1: [type] line N — [description]
Fix: [specific change]
...

When to Use LLaMA 3 vs. Hosted APIs

Use LLaMA 3 locally when:

  • Processing sensitive or private data
  • High volume (thousands of documents per day)
  • Rapid experimentation without API cost concerns
  • You need to fine-tune on proprietary data
  • Offline or air-gapped environments

Use hosted frontier models (Claude, GPT-4o, Gemini) when:

  • You need maximum capability on complex tasks
  • Long context (>128K tokens)
  • Native multimodal reasoning
  • Reliability and uptime matter more than cost
  • You don't have local GPU resources

Consider cloud-hosted LLaMA (Groq, Together AI, Replicate) when:

  • You want open-source model characteristics (cost, control)
  • But don't have local GPU hardware
  • Latency matters (Groq's LPU hardware runs LLaMA at extremely high speeds)

Fine-Tuning Basics

If LLaMA 3's default behavior doesn't match your use case, fine-tuning on your own data is an option not available with closed models.

When fine-tuning makes sense:

  • Consistent output format across thousands of calls
  • Domain-specific knowledge that's not in training data
  • Style matching (writing in a specific brand voice)
  • When you have labeled examples of exactly the input/output you want

Minimum viable fine-tuning (LoRA with unsloth):

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

# Fine-tune on your dataset using SFTTrainer...

LoRA fine-tuning runs on consumer GPUs (RTX 4090 or similar) and takes a fraction of the compute of full fine-tuning. The adapter files are small (~100MB) compared to the full model.


Common Mistakes With LLaMA 3

Expecting frontier-level performance without fine-tuning. LLaMA 3 70B is capable, but falls behind GPT-4o and Claude on nuanced reasoning tasks. If you're hitting consistent failures, either fine-tune or consider a hosted model.

Not testing quantization quality impact. Q4 quantization is usually fine; Q2 can degrade meaningfully. Test your specific task at different quantization levels.

Forgetting to set temperature for deterministic tasks. Same as any model — set temperature to 0 for data extraction and classification tasks, higher for creative tasks.

Not structuring system prompts. LLaMA 3 follows instructions reliably but performs better with explicit, structured system prompts than vague ones.

Want to compare models side by side?

See how Claude, GPT-4o, Gemini, and open-source models stack up for different use cases.

View model comparison →