What is the Mistral instruct format and do I need to handle it manually?

Mistral instruct models use a specific chat template with [INST] and [/INST] tokens. When using the Mistral API or libraries like LiteLLM, this is applied automatically — you just pass normal system/user/assistant message dictionaries. You only need to handle it manually if you're loading model weights directly and doing inference without a chat completion wrapper.

How does Mistral compare to LLaMA 3 and GPT-4o?

Mistral Large and Mistral Medium are competitive with GPT-4o for many tasks, especially coding and structured data extraction. Mistral models are notably efficient — Mistral 7B punches above its weight class. The main trade-off is that Mistral's training data leans European (the company is French) and may perform slightly differently on non-English tasks compared to American-trained models.

What is Mixtral and how does mixture-of-experts work?

Mixtral is Mistral's mixture-of-experts architecture — a model with multiple expert sub-networks where only a subset are activated for each token. Mixtral 8x7B (with ~46B total parameters) activates about 13B parameters per token, giving near-70B-level quality at 7B-level inference cost. It's very efficient for self-hosting and cloud deployments where you're paying per compute rather than per token.

How to Prompt Mistral: Instruct Format, Efficiency, and API Tips

Mistral AI produces some of the most efficient models available — both in terms of quality-per-parameter and cost-per-token. The Mistral model family ranges from the lightweight 7B model to frontier-competitive Mistral Large, all optimized for practical deployment.

The Mistral Model Family

Model	Parameters	Context	Best For
Mistral 7B Instruct	7B	32K	Fast, low-cost, local inference
Mixtral 8x7B	~46B (13B active)	32K	Cost-efficient quality at scale
Mistral Small	~22B	32K	Balanced API performance
Mistral Medium	~12B (MoE)	128K	Mid-tier production tasks
Mistral Large	~123B	128K	Complex reasoning, frontier tasks
Codestral	22B	32K	Code generation specialist

The Mixtral architecture advantage: Mixture-of-experts (MoE) routes each token through only a subset of model parameters. You get near-70B quality at the inference cost of a ~13B model — significant savings at scale.

The Mistral Instruct Format

Mistral instruct models use a specific format. When using the API directly, the template is handled automatically. Here's what it looks like under the hood (useful if loading weights manually):

<s>[INST] {user_message_1} [/INST] {assistant_message_1}</s>
[INST] {user_message_2} [/INST]

For system prompts, Mistral's convention is to include them at the start of the first user message:

<s>[INST] {system_prompt}

{user_message_1} [/INST] {assistant_message_1}</s>

Using the API (automatic formatting):

from mistralai import Mistral

client = Mistral(api_key="your-api-key")

response = client.chat.complete(
    model="mistral-large-latest",
    messages=[
        {
            "role": "system",
            "content": "You are a technical documentation writer. Write clear, concise docs targeting senior developers."
        },
        {
            "role": "user",
            "content": "Document this Python function:\n\n```python\ndef retry(fn, max_attempts=3, backoff=2.0):\n    for attempt in range(max_attempts):\n        try:\n            return fn()\n        except Exception as e:\n            if attempt == max_attempts - 1:\n                raise\n            time.sleep(backoff ** attempt)\n```"
        }
    ]
)
print(response.choices[0].message.content)

Codestral: Mistral's Code Specialist

Codestral is fine-tuned specifically for code tasks and supports a fill-in-the-middle (FIM) API for code completion:

response = client.fim.complete(
    model="codestral-latest",
    prompt="def calculate_tax(income: float, rate: float) -> float:\n    ",
    suffix="\n    return tax_amount"
)
# Returns: the middle part of the function

Fill-in-the-middle is powerful for:

IDE-style code completion (you have the beginning and end, fill the middle)
Refactoring a specific section of a longer function
Test generation (given the function signature and assertions, fill the implementation)

For standard code generation without FIM, Codestral also works as a standard chat model — it simply has stronger coding priors than the general-purpose models.

Efficiency Tips for Production

Use Mixtral 8x7B for cost-sensitive production. The quality-to-cost ratio is exceptional. For classification, extraction, summarization, and many analysis tasks, Mixtral 8x7B matches larger models at a fraction of the cost.

Use function calling for structured extraction:

tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_invoice_data",
            "description": "Extract structured data from an invoice",
            "parameters": {
                "type": "object",
                "properties": {
                    "vendor_name": {"type": "string"},
                    "invoice_number": {"type": "string"},
                    "total_amount": {"type": "number"},
                    "due_date": {"type": "string", "description": "ISO 8601 format"}
                },
                "required": ["vendor_name", "invoice_number", "total_amount"]
            }
        }
    }
]

response = client.chat.complete(
    model="mistral-small-latest",
    messages=[
        {"role": "user", "content": f"Extract the data from this invoice:\n{invoice_text}"}
    ],
    tools=tools,
    tool_choice="any"
)

Batch requests when possible. The Mistral API supports batch inference for offline workloads, significantly reducing cost for non-real-time tasks.

Running Mistral Locally

All Mistral open-weight models are available on Hugging Face and run with Ollama:

# Pull and run Mistral 7B
ollama pull mistral

# Or Mixtral 8x7B (requires ~26GB RAM for Q4)
ollama pull mixtral

# Or Mistral Small (via Ollama's model library)
ollama run mistral-small

Mistral 7B is one of the most popular local models because:

Runs well on consumer hardware (8GB RAM)
Fast inference speed
Strong instruction following for its size

Prompting Patterns That Work Well

Explicit output structure:

Mistral models follow format instructions reliably. Be explicit:

Analyze this customer review and return:
1. Sentiment: positive / negative / neutral
2. Main issue (one sentence)
3. Urgency: high / medium / low

Review: "The product arrived damaged and support hasn't responded in 4 days."

Step-by-step reasoning for complex tasks:

Unlike reasoning models (o1, Claude extended thinking), Mistral benefits from explicit chain-of-thought prompting for complex reasoning:

Work through this problem step by step before giving your final answer.
Show your reasoning clearly.

Problem: [complex problem]

Use JSON mode for data extraction:

response = client.chat.complete(
    model="mistral-large-latest",
    messages=[...],
    response_format={"type": "json_object"}
)

JSON mode constrains the model to produce valid JSON. Combine with a schema description in your prompt for structured extraction.

Common Mistakes With Mistral

Not using JSON mode for extraction tasks. Asking for JSON in the prompt is less reliable than enabling JSON mode in the API call. Use both for critical extraction pipelines.

Choosing Mistral Large when Small or Mixtral would suffice. Benchmark your task at multiple model tiers. Mistral Small handles most common NLP tasks well, and Mixtral 8x7B handles complex tasks at a fraction of Large's cost.

Forgetting system prompt placement. For direct weight inference (not via API), the system prompt goes in the first user message using Mistral's convention — not as a separate <system> tag like in Llama 3.

Under-specifying the output format. Mistral follows instructions well, but "give me a summary" produces variable results. "Write a 3-sentence summary in a formal tone" produces consistent output.