Replay Engine¶
The ReplayEngine reconstructs agent process state from journal entries after a crash. It reads the journal, finds the last checkpoint, and replays subsequent entries to rebuild the context state and lifecycle state -- enabling the process to resume where it left off.
from promptise.runtime.journal import ReplayEngine, FileJournal
journal = FileJournal()
engine = ReplayEngine(journal)
recovered = await engine.recover("data-watcher")
# {
# "context_state": {"pipeline_status": "degraded", "check_count": 42},
# "lifecycle_state": "running",
# "last_entry_type": "invocation_result",
# "entries_replayed": 7,
# }
Concepts¶
Long-running agent processes may crash due to hardware failures, OOM kills, or bugs. Without recovery, the process would restart from scratch, losing all accumulated state. The replay engine solves this by reconstructing state from the durable journal:
- Load the last checkpoint -- the most recent full state snapshot.
- Read entries after the checkpoint -- incremental changes since the snapshot.
- Replay entries sequentially -- apply state transitions and context updates to rebuild the current state.
This is a classic event sourcing recovery pattern: the checkpoint is the snapshot, and journal entries are the event log.
How Recovery Works¶
Step 1: Load the last checkpoint¶
The engine calls journal.last_checkpoint(process_id). If a checkpoint exists, it provides the starting state:
{
"context_state": {"pipeline_status": "healthy", "check_count": 40},
"lifecycle_state": "running",
}
If no checkpoint exists, recovery starts from an empty state (context_state={}, lifecycle_state="created").
Step 2: Find entries after the checkpoint¶
The engine reads all journal entries for the process and locates the last checkpoint entry. All entries after that point are collected for replay.
Step 3: Replay entries¶
Each entry is applied in chronological order:
state_transitionentries update thelifecycle_statefrom theto_statefield.context_updateentries updatecontext_stateby setting thekeytovalue.- Other entry types are noted (the last entry type is tracked) but do not modify the recovered state.
Using the ReplayEngine¶
Basic recovery¶
from promptise.runtime.journal import ReplayEngine, FileJournal
journal = FileJournal(base_path=".promptise/journal")
engine = ReplayEngine(journal)
# Recover state for a process
recovered = await engine.recover("data-watcher")
print(f"Lifecycle state: {recovered['lifecycle_state']}")
print(f"Context state: {recovered['context_state']}")
print(f"Last entry type: {recovered['last_entry_type']}")
print(f"Entries replayed: {recovered['entries_replayed']}")
Recovery result¶
The recover method returns a dict with:
| Key | Type | Description |
|---|---|---|
context_state |
dict[str, Any] |
Reconstructed context key-value state |
lifecycle_state |
str |
Final lifecycle state ("running", "suspended", etc.) |
last_entry_type |
str |
Type of the last replayed entry |
entries_replayed |
int |
Number of entries replayed after the checkpoint |
Applying recovered state¶
After recovery, apply the state to a new AgentContext and ProcessLifecycle:
from promptise.runtime.context import AgentContext
from promptise.runtime.lifecycle import ProcessLifecycle, ProcessState
recovered = await engine.recover("data-watcher")
# Restore context
ctx = AgentContext(initial_state=recovered["context_state"])
# Restore lifecycle
lc = ProcessLifecycle(initial=ProcessState(recovered["lifecycle_state"]))
Recovery Scenarios¶
Scenario 1: Crash after several invocations¶
Journal entries:
1. state_transition: created -> starting
2. state_transition: starting -> running
3. trigger_event: cron fired
4. invocation_result: success
5. checkpoint: {check_count: 10, lifecycle: "running"}
6. trigger_event: cron fired
7. context_update: check_count = 11
8. invocation_result: success
--- CRASH ---
Recovery:
- Loads checkpoint from entry 5:
check_count=10,lifecycle="running" - Replays entries 6-8:
context_updatesetscheck_count=11 - Returns
context_state={"check_count": 11},lifecycle_state="running",entries_replayed=3
Scenario 2: Crash with no checkpoint¶
Journal entries:
1. state_transition: created -> starting
2. state_transition: starting -> running
3. context_update: pipeline_status = "degraded"
--- CRASH ---
Recovery:
- No checkpoint found. Start from empty state.
- Replay all 3 entries.
- Returns
context_state={"pipeline_status": "degraded"},lifecycle_state="running",entries_replayed=3
Scenario 3: No journal entries¶
recovered = await engine.recover("new-process")
# {"context_state": {}, "lifecycle_state": "created", "last_entry_type": "", "entries_replayed": 0}
API Summary¶
| Method | Description |
|---|---|
ReplayEngine(journal) |
Create with a JournalProvider |
await recover(process_id) |
Recover state from journal entries |
Tips and Gotchas¶
Frequent checkpoints reduce replay time
The replay engine only needs to process entries after the last checkpoint. More frequent checkpoints mean fewer entries to replay, but more disk writes. The default "checkpoint" journal level creates a checkpoint after each trigger-invoke-result cycle.
Use full journal level for detailed recovery
With level="full", the journal captures every context_update, giving the replay engine fine-grained state reconstruction. With level="checkpoint", only the checkpoint snapshots are available.
Replay does not restore memory providers
The replay engine reconstructs context_state (key-value store) but does not restore the MemoryProvider. Long-term memory is managed by its own persistence layer (ChromaDB, Mem0, etc.) and does not need journal-based recovery.
Replay does not restore conversation buffer
The conversation buffer (short-term message history) is not replayed from journal entries. If you need conversation persistence across crashes, include it in checkpoint data.
Idempotency matters
The replay engine assumes entries can be applied sequentially. If your journal contains entries with side effects beyond state mutations (e.g., external API calls), those side effects are not replayed -- only the state changes recorded in the journal.
What's Next¶
- Journal Overview -- journal concepts and entry types
- Journal Backends --
FileJournalandInMemoryJournal - Lifecycle -- the state machine that produces
state_transitionentries - Context -- the state layer that produces
context_updateentries