Mistral AI produces some of the most efficient models available — both in terms of quality-per-parameter and cost-per-token. The Mistral model family ranges from the lightweight 7B model to frontier-competitive Mistral Large, all optimized for practical deployment.
The Mistral Model Family
| Model | Parameters | Context | Best For |
|---|---|---|---|
| Mistral 7B Instruct | 7B | 32K | Fast, low-cost, local inference |
| Mixtral 8x7B | ~46B (13B active) | 32K | Cost-efficient quality at scale |
| Mistral Small | ~22B | 32K | Balanced API performance |
| Mistral Medium | ~12B (MoE) | 128K | Mid-tier production tasks |
| Mistral Large | ~123B | 128K | Complex reasoning, frontier tasks |
| Codestral | 22B | 32K | Code generation specialist |
The Mixtral architecture advantage: Mixture-of-experts (MoE) routes each token through only a subset of model parameters. You get near-70B quality at the inference cost of a ~13B model — significant savings at scale.
The Mistral Instruct Format
Mistral instruct models use a specific format. When using the API directly, the template is handled automatically. Here's what it looks like under the hood (useful if loading weights manually):
<s>[INST] {user_message_1} [/INST] {assistant_message_1}</s>
[INST] {user_message_2} [/INST]
For system prompts, Mistral's convention is to include them at the start of the first user message:
<s>[INST] {system_prompt}
{user_message_1} [/INST] {assistant_message_1}</s>
Using the API (automatic formatting):
from mistralai import Mistral
client = Mistral(api_key="your-api-key")
response = client.chat.complete(
model="mistral-large-latest",
messages=[
{
"role": "system",
"content": "You are a technical documentation writer. Write clear, concise docs targeting senior developers."
},
{
"role": "user",
"content": "Document this Python function:\n\n```python\ndef retry(fn, max_attempts=3, backoff=2.0):\n for attempt in range(max_attempts):\n try:\n return fn()\n except Exception as e:\n if attempt == max_attempts - 1:\n raise\n time.sleep(backoff ** attempt)\n```"
}
]
)
print(response.choices[0].message.content)
Codestral: Mistral's Code Specialist
Codestral is fine-tuned specifically for code tasks and supports a fill-in-the-middle (FIM) API for code completion:
response = client.fim.complete(
model="codestral-latest",
prompt="def calculate_tax(income: float, rate: float) -> float:\n ",
suffix="\n return tax_amount"
)
# Returns: the middle part of the function
Fill-in-the-middle is powerful for:
- IDE-style code completion (you have the beginning and end, fill the middle)
- Refactoring a specific section of a longer function
- Test generation (given the function signature and assertions, fill the implementation)
For standard code generation without FIM, Codestral also works as a standard chat model — it simply has stronger coding priors than the general-purpose models.
Efficiency Tips for Production
Use Mixtral 8x7B for cost-sensitive production. The quality-to-cost ratio is exceptional. For classification, extraction, summarization, and many analysis tasks, Mixtral 8x7B matches larger models at a fraction of the cost.
Use function calling for structured extraction:
tools = [
{
"type": "function",
"function": {
"name": "extract_invoice_data",
"description": "Extract structured data from an invoice",
"parameters": {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"},
"due_date": {"type": "string", "description": "ISO 8601 format"}
},
"required": ["vendor_name", "invoice_number", "total_amount"]
}
}
}
]
response = client.chat.complete(
model="mistral-small-latest",
messages=[
{"role": "user", "content": f"Extract the data from this invoice:\n{invoice_text}"}
],
tools=tools,
tool_choice="any"
)
Batch requests when possible. The Mistral API supports batch inference for offline workloads, significantly reducing cost for non-real-time tasks.
Running Mistral Locally
All Mistral open-weight models are available on Hugging Face and run with Ollama:
# Pull and run Mistral 7B
ollama pull mistral
# Or Mixtral 8x7B (requires ~26GB RAM for Q4)
ollama pull mixtral
# Or Mistral Small (via Ollama's model library)
ollama run mistral-small
Mistral 7B is one of the most popular local models because:
- Runs well on consumer hardware (8GB RAM)
- Fast inference speed
- Strong instruction following for its size
Prompting Patterns That Work Well
Explicit output structure:
Mistral models follow format instructions reliably. Be explicit:
Analyze this customer review and return:
1. Sentiment: positive / negative / neutral
2. Main issue (one sentence)
3. Urgency: high / medium / low
Review: "The product arrived damaged and support hasn't responded in 4 days."
Step-by-step reasoning for complex tasks:
Unlike reasoning models (o1, Claude extended thinking), Mistral benefits from explicit chain-of-thought prompting for complex reasoning:
Work through this problem step by step before giving your final answer.
Show your reasoning clearly.
Problem: [complex problem]
Use JSON mode for data extraction:
response = client.chat.complete(
model="mistral-large-latest",
messages=[...],
response_format={"type": "json_object"}
)
JSON mode constrains the model to produce valid JSON. Combine with a schema description in your prompt for structured extraction.
Common Mistakes With Mistral
Not using JSON mode for extraction tasks. Asking for JSON in the prompt is less reliable than enabling JSON mode in the API call. Use both for critical extraction pipelines.
Choosing Mistral Large when Small or Mixtral would suffice. Benchmark your task at multiple model tiers. Mistral Small handles most common NLP tasks well, and Mixtral 8x7B handles complex tasks at a fraction of Large's cost.
Forgetting system prompt placement. For direct weight inference (not via API), the system prompt goes in the first user message using Mistral's convention — not as a separate <system> tag like in Llama 3.
Under-specifying the output format. Mistral follows instructions well, but "give me a summary" produces variable results. "Write a 3-sentence summary in a formal tone" produces consistent output.