What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Semantic Caching for LLM Apps — Cut API Costs with Redis and GPTCache

An FAQ bot I built for a SaaS company was burning $4,200/month on Claude API calls. After implementing semantic caching, that dropped to $1,800. The code took two days. The ROI was immediate.

The core idea: users ask the same question in different ways. "What's your refund policy?" and "how do I get my money back?" are semantically identical. Exact-match string caching misses the second. Semantic caching catches both.

Exact-match vs semantic caching

Exact-match caching stores the raw query string as a cache key. If two requests are byte-for-byte identical, you return the cached response. Works great for programmatic callers with templated queries. Fails for any natural language interface where users paraphrase.

Semantic caching embeds the query into a vector, searches for similar cached queries, and returns a cached response if similarity exceeds a threshold. This is the one you want for chatbots, support agents, and search interfaces.

When semantic caching applies:

FAQ and support bots where users phrase the same question differently
Search interfaces with synonymous queries
Structured data extraction tasks (same underlying data, slightly rephrased request)

When it doesn't:

Personalized responses (user account data, history-dependent answers)
Time-sensitive queries ("what's the stock price now?")
Creative tasks where variation is the point
Anything with user-specific context that changes the correct answer

GPTCache: the fast path

GPTCache is a Python library that wraps your LLM client and handles semantic caching automatically. Setup is fast:

pip install gptcache anthropic

from gptcache import cache
from gptcache.adapter.api import init_similar_cache
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
import anthropic

# Initialize GPTCache with local SQLite + FAISS (good for dev/testing)
init_similar_cache(
    cache_obj=cache,
    embedding=Onnx(),
    data_manager=get_data_manager(
        CacheBase("sqlite"),
        VectorBase("faiss", dimension=512)
    ),
    evaluation=SearchDistanceEvaluation(),
)

# Now use anthropic normally — GPTCache intercepts the call
client = anthropic.Anthropic()

def ask(question: str) -> str:
    # GPTCache checks for similar cached responses before hitting the API
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=500,
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

# First call — hits API
print(ask("What is your return policy?"))

# Second call with paraphrase — cache hit (no API call)
print(ask("How do I get a refund?"))

GPTCache's default similarity threshold is around 0.9. You can tune it — lower threshold means more aggressive caching (more hits, more risk of returning slightly wrong answers).

For production, swap the local SQLite/FAISS backend for Redis:

from gptcache.manager import VectorBase

vector_store = VectorBase(
    "redis",
    host="your-redis-host",
    port=6379,
    password="your-password",
    dimension=512
)

Upstash offers managed Redis with a free tier and pay-per-request pricing — a good fit for semantic cache workloads with bursty traffic. Try Upstash →

Prefer self-hosting? A Hostinger KVM 1 VPS (~₹300/month) running Redis directly gives you unlimited requests with no per-call pricing.

Building a semantic cache from scratch

If you want full control, skip GPTCache and build it yourself. This is more code but gives you explicit control over the similarity threshold, TTL, and cache invalidation.

import numpy as np
import redis
import json
import anthropic
from anthropic import Anthropic

client = Anthropic()

class SemanticCache:
    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        similarity_threshold: float = 0.92,
        ttl_seconds: int = 86_400,  # 24 hours
        embedding_model: str = "text-embedding-3-small"
    ):
        self.r = redis.from_url(redis_url)
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.cache_prefix = "semcache:"
        
    def _embed(self, text: str) -> np.ndarray:
        """Get embedding for a query. Using OpenAI here for embeddings;
        swap with any embedding API or local model."""
        import openai
        response = openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return np.array(response.data[0].embedding, dtype=np.float32)
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
    
    def get(self, query: str) -> str | None:
        """Check cache. Returns cached response if similar query found, else None."""
        query_embedding = self._embed(query)
        
        # Scan all cached entries (use Redis SCAN in production, not KEYS)
        keys = self.r.scan_iter(f"{self.cache_prefix}*")
        
        best_similarity = 0.0
        best_response = None
        
        for key in keys:
            entry = self.r.get(key)
            if not entry:
                continue
            
            data = json.loads(entry)
            cached_embedding = np.array(data["embedding"], dtype=np.float32)
            similarity = self._cosine_similarity(query_embedding, cached_embedding)
            
            if similarity > best_similarity:
                best_similarity = similarity
                best_response = data["response"]
        
        if best_similarity >= self.threshold:
            print(f"Cache hit (similarity={best_similarity:.3f})")
            return best_response
        
        return None
    
    def set(self, query: str, response: str) -> None:
        """Store query-response pair in cache."""
        embedding = self._embed(query)
        
        # Use a hash of the query as the key
        import hashlib
        key = f"{self.cache_prefix}{hashlib.md5(query.encode()).hexdigest()}"
        
        data = {
            "query": query,
            "response": response,
            "embedding": embedding.tolist()
        }
        
        self.r.setex(key, self.ttl, json.dumps(data))
    
    def ask(self, query: str, system_prompt: str = "") -> str:
        """Check cache first, hit API if miss."""
        cached = self.get(query)
        if cached:
            return cached
        
        # Cache miss — call the API
        messages = [{"role": "user", "content": query}]
        kwargs = {"model": "claude-haiku-3-5", "max_tokens": 500, "messages": messages}
        if system_prompt:
            kwargs["system"] = system_prompt
        
        response = client.messages.create(**kwargs)
        answer = response.content[0].text
        
        self.set(query, answer)
        return answer


# Usage
cache = SemanticCache(similarity_threshold=0.92, ttl_seconds=86_400)

answer1 = cache.ask("What is your return policy?")
answer2 = cache.ask("How do I request a refund?")  # Cache hit if similar enough

The 0.92 threshold is a good starting point. Below 0.88 you'll get false positives — returning cached answers for questions that are similar but not equivalent. Above 0.95 you'll miss obvious paraphrases. Tune it by logging similarity scores for a week and examining borderline cases.

Using Redis Vector Search (RediSearch) for scale

The implementation above scans all cache entries linearly — it's O(n) and won't scale past a few thousand entries. For production, use RediSearch's vector similarity search:

import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import numpy as np

r = redis.Redis(host="localhost", port=6379)

# Create vector index (run once at startup)
schema = [
    TextField("query"),
    TextField("response"),
    VectorField(
        "embedding",
        "HNSW",
        {"TYPE": "FLOAT32", "DIM": 1536, "DISTANCE_METRIC": "COSINE"}
    )
]

r.ft("semcache_idx").create_index(
    schema,
    definition=IndexDefinition(prefix=["semcache:"], index_type=IndexType.HASH)
)

def vector_search(query_embedding: np.ndarray, top_k: int = 1):
    """Find most similar cached query using HNSW index."""
    query_bytes = query_embedding.astype(np.float32).tobytes()
    
    q = (
        Query(f"*=>[KNN {top_k} @embedding $vec AS score]")
        .sort_by("score")
        .return_fields("query", "response", "score")
        .dialect(2)
    )
    
    results = r.ft("semcache_idx").search(q, query_params={"vec": query_bytes})
    return results.docs

HNSW index search is O(log n) and handles millions of entries. This is what you want for high-traffic production deployments.

Real cache hit rates and cost savings

Cache hit rates vary by use case:

Use case	Expected cache hit rate
FAQ bot (support)	30-60%
E-commerce search	20-40%
General chat	5-15%
Structured extraction	40-70%
Creative writing	<5%

Cost savings math: if you're running 10,000 queries/day on Claude Haiku at $0.80/million input tokens (average 500 tokens per query = $0.0004 per query), and you achieve 40% cache hit rate:

Without cache: 10,000 × $0.0004 = $4.00/day → $1,460/year
With 40% cache hits: 6,000 × $0.0004 = $2.40/day → $876/year
Annual savings: $584

At Sonnet pricing ($3/million input tokens), those numbers multiply by ~4x — $2,190/year savings from a two-day implementation.

Cache infrastructure costs matter too. A Redis instance on Upstash or AWS ElastiCache costs $15-50/month depending on size. You'll recoup that quickly at any meaningful traffic level.

For the embedding API calls: you're spending ~$0.0001 per query to generate embeddings (text-embedding-3-small). That's negligible against API savings. You can also use a local embedding model (all-MiniLM-L6-v2 via sentence-transformers) to eliminate that cost entirely.

TTL strategy

Don't cache indefinitely. Your underlying information changes, and stale cached answers are worse than no cache.

FAQ/policy content: 24-48 hour TTL. Refresh when you update your docs.
Product information: 1-4 hour TTL if prices/availability change frequently
General knowledge queries: 7-day TTL
Time-sensitive queries: Don't cache at all

When you update your knowledge base or system prompt significantly, flush the cache. A simple cache version prefix in your Redis keys makes this trivial:

CACHE_VERSION = "v3"
key = f"semcache:{CACHE_VERSION}:{hashlib.md5(query.encode()).hexdigest()}"
# Bump CACHE_VERSION to invalidate all entries

What NOT to cache

Some queries look cacheable but aren't:

"What's my order status?" — answer depends on user identity
"Summarize the document I uploaded" — answer depends on the attached file
"Is this medication safe for me?" — requires personalization; returning a generic answer is dangerous
"What's happening in the news today?" — time-sensitive
Any query where the correct answer requires the user's account data

Add a query classifier that routes personalized queries past the cache entirely. A simple keyword list works as a first pass:

SKIP_CACHE_PATTERNS = [
    "my order", "my account", "my subscription", "my usage",
    "today", "right now", "current", "latest",
    "i uploaded", "the file i", "the document i"
]

def should_skip_cache(query: str) -> bool:
    q = query.lower()
    return any(pattern in q for pattern in SKIP_CACHE_PATTERNS)

Combine this with the token counting guide to build a complete cost optimization layer: cache hits eliminate API calls entirely, and for cache misses, token budgeting ensures you're not spending more than necessary per call.

For larger-scale cost optimization across your full agent stack, the agent cost optimization guide covers model routing, batching, and other strategies that compound with caching. The build vector store guide goes deeper on FAISS, Chroma, and Pinecone if you want more control over the vector search layer.

Semantic caching won't work for every app. But for any LLM feature with a FAQ or support use case, it's one of the highest-ROI optimizations available.

An FAQ bot I built for a SaaS company was burning $4,200/month on Claude API calls. After implementing semantic caching, that dropped to $1,800. The code took two days. The ROI was immediate.

Exact-match vs semantic caching

When semantic caching applies:

FAQ and support bots where users phrase the same question differently
Search interfaces with synonymous queries
Structured data extraction tasks (same underlying data, slightly rephrased request)

When it doesn't:

Personalized responses (user account data, history-dependent answers)
Time-sensitive queries ("what's the stock price now?")
Creative tasks where variation is the point
Anything with user-specific context that changes the correct answer

GPTCache: the fast path

GPTCache is a Python library that wraps your LLM client and handles semantic caching automatically. Setup is fast:

pip install gptcache anthropic

from gptcache import cache
from gptcache.adapter.api import init_similar_cache
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
import anthropic

# Initialize GPTCache with local SQLite + FAISS (good for dev/testing)
init_similar_cache(
    cache_obj=cache,
    embedding=Onnx(),
    data_manager=get_data_manager(
        CacheBase("sqlite"),
        VectorBase("faiss", dimension=512)
    ),
    evaluation=SearchDistanceEvaluation(),
)

# Now use anthropic normally — GPTCache intercepts the call
client = anthropic.Anthropic()

def ask(question: str) -> str:
    # GPTCache checks for similar cached responses before hitting the API
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=500,
        messages=[{"role": "user", "content": question}]
    )
    return response.content[0].text

# First call — hits API
print(ask("What is your return policy?"))

# Second call with paraphrase — cache hit (no API call)
print(ask("How do I get a refund?"))

GPTCache's default similarity threshold is around 0.9. You can tune it — lower threshold means more aggressive caching (more hits, more risk of returning slightly wrong answers).

For production, swap the local SQLite/FAISS backend for Redis:

from gptcache.manager import VectorBase

vector_store = VectorBase(
    "redis",
    host="your-redis-host",
    port=6379,
    password="your-password",
    dimension=512
)

Upstash offers managed Redis with a free tier and pay-per-request pricing — a good fit for semantic cache workloads with bursty traffic. Try Upstash →

Prefer self-hosting? A Hostinger KVM 1 VPS (~₹300/month) running Redis directly gives you unlimited requests with no per-call pricing.

Building a semantic cache from scratch

If you want full control, skip GPTCache and build it yourself. This is more code but gives you explicit control over the similarity threshold, TTL, and cache invalidation.

import numpy as np
import redis
import json
import anthropic
from anthropic import Anthropic

client = Anthropic()

class SemanticCache:
    def __init__(
        self,
        redis_url: str = "redis://localhost:6379",
        similarity_threshold: float = 0.92,
        ttl_seconds: int = 86_400,  # 24 hours
        embedding_model: str = "text-embedding-3-small"
    ):
        self.r = redis.from_url(redis_url)
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.cache_prefix = "semcache:"
        
    def _embed(self, text: str) -> np.ndarray:
        """Get embedding for a query. Using OpenAI here for embeddings;
        swap with any embedding API or local model."""
        import openai
        response = openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return np.array(response.data[0].embedding, dtype=np.float32)
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
    
    def get(self, query: str) -> str | None:
        """Check cache. Returns cached response if similar query found, else None."""
        query_embedding = self._embed(query)
        
        # Scan all cached entries (use Redis SCAN in production, not KEYS)
        keys = self.r.scan_iter(f"{self.cache_prefix}*")
        
        best_similarity = 0.0
        best_response = None
        
        for key in keys:
            entry = self.r.get(key)
            if not entry:
                continue
            
            data = json.loads(entry)
            cached_embedding = np.array(data["embedding"], dtype=np.float32)
            similarity = self._cosine_similarity(query_embedding, cached_embedding)
            
            if similarity > best_similarity:
                best_similarity = similarity
                best_response = data["response"]
        
        if best_similarity >= self.threshold:
            print(f"Cache hit (similarity={best_similarity:.3f})")
            return best_response
        
        return None
    
    def set(self, query: str, response: str) -> None:
        """Store query-response pair in cache."""
        embedding = self._embed(query)
        
        # Use a hash of the query as the key
        import hashlib
        key = f"{self.cache_prefix}{hashlib.md5(query.encode()).hexdigest()}"
        
        data = {
            "query": query,
            "response": response,
            "embedding": embedding.tolist()
        }
        
        self.r.setex(key, self.ttl, json.dumps(data))
    
    def ask(self, query: str, system_prompt: str = "") -> str:
        """Check cache first, hit API if miss."""
        cached = self.get(query)
        if cached:
            return cached
        
        # Cache miss — call the API
        messages = [{"role": "user", "content": query}]
        kwargs = {"model": "claude-haiku-3-5", "max_tokens": 500, "messages": messages}
        if system_prompt:
            kwargs["system"] = system_prompt
        
        response = client.messages.create(**kwargs)
        answer = response.content[0].text
        
        self.set(query, answer)
        return answer


# Usage
cache = SemanticCache(similarity_threshold=0.92, ttl_seconds=86_400)

answer1 = cache.ask("What is your return policy?")
answer2 = cache.ask("How do I request a refund?")  # Cache hit if similar enough

Using Redis Vector Search (RediSearch) for scale

The implementation above scans all cache entries linearly — it's O(n) and won't scale past a few thousand entries. For production, use RediSearch's vector similarity search:

import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import numpy as np

r = redis.Redis(host="localhost", port=6379)

# Create vector index (run once at startup)
schema = [
    TextField("query"),
    TextField("response"),
    VectorField(
        "embedding",
        "HNSW",
        {"TYPE": "FLOAT32", "DIM": 1536, "DISTANCE_METRIC": "COSINE"}
    )
]

r.ft("semcache_idx").create_index(
    schema,
    definition=IndexDefinition(prefix=["semcache:"], index_type=IndexType.HASH)
)

def vector_search(query_embedding: np.ndarray, top_k: int = 1):
    """Find most similar cached query using HNSW index."""
    query_bytes = query_embedding.astype(np.float32).tobytes()
    
    q = (
        Query(f"*=>[KNN {top_k} @embedding $vec AS score]")
        .sort_by("score")
        .return_fields("query", "response", "score")
        .dialect(2)
    )
    
    results = r.ft("semcache_idx").search(q, query_params={"vec": query_bytes})
    return results.docs

HNSW index search is O(log n) and handles millions of entries. This is what you want for high-traffic production deployments.

Real cache hit rates and cost savings

Cache hit rates vary by use case:

Use case	Expected cache hit rate
FAQ bot (support)	30-60%
E-commerce search	20-40%
General chat	5-15%
Structured extraction	40-70%
Creative writing	<5%

Cost savings math: if you're running 10,000 queries/day on Claude Haiku at $0.80/million input tokens (average 500 tokens per query = $0.0004 per query), and you achieve 40% cache hit rate:

Without cache: 10,000 × $0.0004 = $4.00/day → $1,460/year
With 40% cache hits: 6,000 × $0.0004 = $2.40/day → $876/year
Annual savings: $584

At Sonnet pricing ($3/million input tokens), those numbers multiply by ~4x — $2,190/year savings from a two-day implementation.

Cache infrastructure costs matter too. A Redis instance on Upstash or AWS ElastiCache costs $15-50/month depending on size. You'll recoup that quickly at any meaningful traffic level.

TTL strategy

Don't cache indefinitely. Your underlying information changes, and stale cached answers are worse than no cache.

FAQ/policy content: 24-48 hour TTL. Refresh when you update your docs.
Product information: 1-4 hour TTL if prices/availability change frequently
General knowledge queries: 7-day TTL
Time-sensitive queries: Don't cache at all

When you update your knowledge base or system prompt significantly, flush the cache. A simple cache version prefix in your Redis keys makes this trivial:

CACHE_VERSION = "v3"
key = f"semcache:{CACHE_VERSION}:{hashlib.md5(query.encode()).hexdigest()}"
# Bump CACHE_VERSION to invalidate all entries

What NOT to cache

Some queries look cacheable but aren't:

"What's my order status?" — answer depends on user identity
"Summarize the document I uploaded" — answer depends on the attached file
"Is this medication safe for me?" — requires personalization; returning a generic answer is dangerous
"What's happening in the news today?" — time-sensitive
Any query where the correct answer requires the user's account data

Add a query classifier that routes personalized queries past the cache entirely. A simple keyword list works as a first pass:

SKIP_CACHE_PATTERNS = [
    "my order", "my account", "my subscription", "my usage",
    "today", "right now", "current", "latest",
    "i uploaded", "the file i", "the document i"
]

def should_skip_cache(query: str) -> bool:
    q = query.lower()
    return any(pattern in q for pattern in SKIP_CACHE_PATTERNS)

Semantic caching won't work for every app. But for any LLM feature with a FAQ or support use case, it's one of the highest-ROI optimizations available.

Semantic Caching for LLM Apps — Cut API Costs with Redis and GPTCache

Exact-match vs semantic caching

GPTCache: the fast path

Building a semantic cache from scratch

Using Redis Vector Search (RediSearch) for scale

Real cache hit rates and cost savings

TTL strategy

What NOT to cache

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Claude API vs OpenAI API — Developer Comparison Guide (2026)

Semantic Caching for LLM Apps — Cut API Costs with Redis and GPTCache

Exact-match vs semantic caching

GPTCache: the fast path

Building a semantic cache from scratch

Using Redis Vector Search (RediSearch) for scale

Real cache hit rates and cost savings

TTL strategy

What NOT to cache

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Build a Vector Store for RAG — FAISS vs Chroma vs Pinecone (With Code)

Claude API vs OpenAI API — Developer Comparison Guide (2026)