Semantic Cache¶

Cache LLM responses by query similarity. Save 30-50% on API costs by serving cached responses for semantically similar queries. All embedding runs locally by default.

from promptise import build_agent, SemanticCache, CallerContext
from promptise.config import HTTPServerSpec

cache = SemanticCache()
cache.warmup()

agent = await build_agent(
    servers={"tools": HTTPServerSpec(url="http://localhost:8000/mcp")},
    model="openai:gpt-5-mini",
    cache=cache,
)

# First call → LLM, result cached
result = await agent.ainvoke(input, caller=CallerContext(user_id="user-42"))

# Second similar call → cache hit, no LLM call, instant response
result = await agent.ainvoke(input, caller=CallerContext(user_id="user-42"))

Not legal or compliance advice

The information here is general technical information, not legal, regulatory, or compliance advice. Descriptions of any law, regulation, or standard (such as the GDPR, the EU AI Act, HIPAA, SOC 2, or PCI DSS) are simplified and may be incomplete, out of date, or inaccurate, and requirements vary by jurisdiction and situation. Promptise Foundry makes no warranty as to the accuracy or completeness of this content and is not responsible for how you use or rely on it. Using Promptise does not by itself make you or your product compliant with any law or standard. Consult a qualified lawyer or compliance professional before acting on anything here.

How It Works¶

User sends a message
Input guardrails scan the message (block injection attacks)
Memory search runs (so the cache key includes memory context)
Cache check — embed the query, search for similar cached queries with matching context
Cache hit → run output guardrails on cached response → return instantly
Cache miss → continue to tools, LLM → output guardrails → cache the post-guardrail response → return

Cache check runs after input guardrails and memory search (so the cache key reflects current memory state) but before tool selection and LLM call. Cached responses are stored after output guardrails — only safe, redacted content is ever persisted in the cache.

Security: Per-User Isolation¶

Default scope is per_user — every user gets an isolated cache partition. User A's cached responses are invisible to User B.

No CallerContext = no caching. If you don't pass caller=CallerContext(user_id=...), caching is silently disabled for that request. This prevents accidental cross-user data leakage.

# ✅ Cached (user isolated)
result = await agent.ainvoke(input, caller=CallerContext(user_id="user-42"))

# ❌ Not cached (no identity)
result = await agent.ainvoke(input)

Three scopes:

Scope	Behavior	Use case
`per_user` (default)	Each user has their own cache	Any personalized agent
`per_session`	Each session has its own cache	Conversation-specific
`shared`	All users share one cache	Public knowledge (weather, docs, FAQ)

# Shared scope — requires explicit acknowledgment
cache = SemanticCache(scope="shared", shared_data_acknowledged=True)

Standalone / Shared Mode (No Multi-User)¶

If you're building a single-user app, internal tool, or public FAQ agent where there's no concept of "users," use scope="shared":

cache = SemanticCache(scope="shared", shared_data_acknowledged=True)

agent = await build_agent(
    servers={...}, model="openai:gpt-5-mini", cache=cache,
)

# No CallerContext needed — works immediately
result = await agent.ainvoke({"messages": [{"role": "user", "content": "What is Python?"}]})

With scope="shared", caching works without CallerContext. Everyone shares the same cache. Use this when:

Your agent answers public knowledge questions (docs, FAQ, weather)
There's only one user (CLI tools, internal scripts)
Responses never contain personalized data

Shared scope = no isolation

Every user sees everyone else's cached responses. Never use shared scope when the agent accesses user-specific data (accounts, orders, personal info).

Multi-User Mode¶

For apps with multiple users (SaaS, customer support, multi-tenant), the default per_user scope isolates each user's cache automatically:

cache = SemanticCache()  # scope="per_user" is the default

agent = await build_agent(..., cache=cache)

# Each user has their own cache partition
await agent.ainvoke(input, caller=CallerContext(user_id="alice"))  # Alice's cache
await agent.ainvoke(input, caller=CallerContext(user_id="bob"))    # Bob's cache (separate)

How it works internally: - Cache key prefix is user:{user_id} — Alice's entries are keyed user:alice, Bob's are user:bob - With CallerContext(tenant_id="acme") the key is tenant-qualified — an injective, colon-prefixed hash (user:t:<sha256>) disjoint from the untenanted namespace, so two tenants with the same user_id can never share a cache partition - Similarity search only runs within a user's own partition — no cross-user matching possible - If no CallerContext is provided, caching is silently disabled for that request (with a debug log: "Cache: no CallerContext or user_id provided") - purge_user("alice") removes all of Alice's cached entries (GDPR compliance)

Per-session mode isolates even further — each conversation session has its own cache:

cache = SemanticCache(scope="per_session")

caller = CallerContext(
    user_id="alice",
    metadata={"session_id": "sess_abc123"},
)
await agent.ainvoke(input, caller=caller)

Configuration¶

cache = SemanticCache(
    backend="memory",                # "memory" or "redis"
    similarity_threshold=0.92,       # 0.0-1.0 (higher = stricter matching)
    default_ttl=3600,                # seconds
    scope="per_user",                # "per_user", "per_session", "shared"
    max_entries_per_user=1000,
    max_total_entries=100_000,
    invalidate_on_write=True,        # evict cache when write tools fire
    ttl_patterns={                   # regex → TTL for time-sensitive queries
        r"current|now|today|latest": 60,
        r"price|stock|rate": 30,
    },
)

Parameter	Type	Default	Description
`backend`	`str`	`"memory"`	`"memory"` or `"redis"`
`redis_url`	`str`	`None`	Redis connection URL
`embedding`	`EmbeddingProvider \\| str`	Local model	Embedding provider or model name
`similarity_threshold`	`float`	`0.92`	Min cosine similarity for cache hit
`default_ttl`	`int`	`3600`	Default time-to-live in seconds
`scope`	`str`	`"per_user"`	Cache isolation scope
`max_entries_per_user`	`int`	`1000`	Max entries per scope partition
`max_total_entries`	`int`	`100_000`	Max entries across all scopes
`encrypt_values`	`bool`	`False`	AES encryption at rest (Redis)
`ttl_patterns`	`dict`	`None`	Regex → TTL overrides
`invalidate_on_write`	`bool`	`True`	Evict cache on write tool calls
`cache_multi_turn`	`bool`	`False`	Cache multi-turn conversations

Cache Backends¶

In-Memory (default)¶

Zero dependencies. Sub-millisecond lookups. Lost on restart. Single-process only.

cache = SemanticCache(backend="memory")

Redis¶

Shared across workers and servers. Survives restarts. Optional AES encryption at rest.

cache = SemanticCache(
    backend="redis",
    redis_url="redis://localhost:6379",
    encrypt_values=True,  # AES encryption — set PROMPTISE_CACHE_KEY env var
)

Requires pip install redis. Encryption requires pip install cryptography.

Set PROMPTISE_CACHE_KEY to a Fernet key for persistent encryption across restarts. If not set, a key is auto-generated per process (cache won't survive restart).

Graceful degradation: If Redis is unreachable, cache operations fail silently (logged as warnings) and the agent continues normally — LLM is called directly.

Embedding Providers¶

Local (default)¶

Uses sentence-transformers — the same model used for semantic tool optimization. Zero API calls, runs locally.

cache = SemanticCache()  # all-MiniLM-L6-v2
cache = SemanticCache(embedding="BAAI/bge-small-en-v1.5")
cache = SemanticCache(embedding="/models/local/custom")

OpenAI / Azure OpenAI¶

from promptise import OpenAIEmbeddingProvider

cache = SemanticCache(
    embedding=OpenAIEmbeddingProvider(
        model="text-embedding-3-small",
        api_key="${OPENAI_API_KEY}",
    ),
)

Custom provider¶

Any object implementing the EmbeddingProvider protocol:

class MyProvider:
    async def embed(self, texts: list[str]) -> list[list[float]]:
        return my_model.encode(texts)

cache = SemanticCache(embedding=MyProvider())

Cache Key¶

The cache key determines when a hit occurs. It includes:

Component	What it prevents
Scope prefix (`user:42`)	Cross-user data leakage
Query embedding	Semantic similarity matching
Context fingerprint	Stale answers after memory/history changes
Model ID	Serving GPT responses as Claude responses
Instruction hash	Stale responses after prompt updates

If any component changes, the cache misses and a fresh LLM call is made.

Write Invalidation¶

When a tool with read_only_hint=False fires (create, update, delete), the cache for that scope is evicted. This prevents stale data:

"How many tickets are open?" → cached "47"
create_ticket() fires → cache evicted
"How many tickets are open?" → fresh LLM call → "48"

Disable with invalidate_on_write=False if your tools don't affect query results.

# Delete all cached data for a user
count = await cache.purge_user("user-42")

# Tenant-scoped callers: purge exactly that tenant's scope
count = await cache.purge_user("user-42", tenant_id="acme")

Observability¶

Cache events appear in the observability timeline:

cache.hit — response served from cache (with similarity score, cache age)
cache.miss — no cache hit, proceeding to LLM
cache.store — new response stored in cache
cache.error — cache operation failed (non-blocking)

What's Next?¶

Guardrails — output guardrails always run on cached responses
Tool Optimization — shares the same embedding model
Building Agents — the cache parameter on build_agent()