Replay Engine¶

The ReplayEngine reconstructs agent process state from journal entries after a crash. It reads the journal, finds the last checkpoint, and replays subsequent entries to rebuild the context state and lifecycle state -- enabling the process to resume where it left off.

from promptise.runtime.journal import ReplayEngine, FileJournal

journal = FileJournal()
engine = ReplayEngine(journal)

recovered = await engine.recover("data-watcher")
# {
#     "context_state": {"pipeline_status": "degraded", "check_count": 42},
#     "lifecycle_state": "running",
#     "last_entry_type": "invocation_result",
#     "entries_replayed": 7,
# }

Concepts¶

Long-running agent processes may crash due to hardware failures, OOM kills, or bugs. Without recovery, the process would restart from scratch, losing all accumulated state. The replay engine solves this by reconstructing state from the durable journal:

Load the last checkpoint -- the most recent full state snapshot.
Read entries after the checkpoint -- incremental changes since the snapshot.
Replay entries sequentially -- apply state transitions and context updates to rebuild the current state.

This is a classic event sourcing recovery pattern: the checkpoint is the snapshot, and journal entries are the event log.

How Recovery Works¶

Step 1: Load the last checkpoint¶

The engine calls journal.last_checkpoint(process_id). If a checkpoint exists, it provides the starting state:

{
    "context_state": {"pipeline_status": "healthy", "check_count": 40},
    "lifecycle_state": "running",
}

If no checkpoint exists, recovery starts from an empty state (context_state={}, lifecycle_state="created").

Step 2: Find entries after the checkpoint¶

The engine reads all journal entries for the process and locates the last checkpoint entry. All entries after that point are collected for replay.

Step 3: Replay entries¶

Each entry is applied in chronological order:

state_transition entries update the lifecycle_state from the to_state field.
context_update entries update context_state by setting the key to value.
Other entry types are noted (the last entry type is tracked) but do not modify the recovered state.

Using the ReplayEngine¶

Basic recovery¶

from promptise.runtime.journal import ReplayEngine, FileJournal

journal = FileJournal(base_path=".promptise/journal")
engine = ReplayEngine(journal)

# Recover state for a process
recovered = await engine.recover("data-watcher")

print(f"Lifecycle state: {recovered['lifecycle_state']}")
print(f"Context state:   {recovered['context_state']}")
print(f"Last entry type: {recovered['last_entry_type']}")
print(f"Entries replayed: {recovered['entries_replayed']}")

Recovery result¶

The recover method returns a dict with:

Key	Type	Description
`context_state`	`dict[str, Any]`	Reconstructed context key-value state
`lifecycle_state`	`str`	Final lifecycle state (`"running"`, `"suspended"`, etc.)
`last_entry_type`	`str`	Type of the last replayed entry
`entries_replayed`	`int`	Number of entries replayed after the checkpoint

Applying recovered state¶

After recovery, apply the state to a new AgentContext and ProcessLifecycle:

from promptise.runtime.context import AgentContext
from promptise.runtime.lifecycle import ProcessLifecycle, ProcessState

recovered = await engine.recover("data-watcher")

# Restore context
ctx = AgentContext(initial_state=recovered["context_state"])

# Restore lifecycle
lc = ProcessLifecycle(initial=ProcessState(recovered["lifecycle_state"]))

Recovery Scenarios¶

Scenario 1: Crash after several invocations¶

Journal entries:
  1. state_transition: created -> starting
  2. state_transition: starting -> running
  3. trigger_event: cron fired
  4. invocation_result: success
  5. checkpoint: {check_count: 10, lifecycle: "running"}
  6. trigger_event: cron fired
  7. context_update: check_count = 11
  8. invocation_result: success
  --- CRASH ---

Recovery:

Loads checkpoint from entry 5: check_count=10, lifecycle="running"
Replays entries 6-8: context_update sets check_count=11
Returns context_state={"check_count": 11}, lifecycle_state="running", entries_replayed=3

Scenario 2: Crash with no checkpoint¶

Journal entries:
  1. state_transition: created -> starting
  2. state_transition: starting -> running
  3. context_update: pipeline_status = "degraded"
  --- CRASH ---

Recovery:

No checkpoint found. Start from empty state.
Replay all 3 entries.
Returns context_state={"pipeline_status": "degraded"}, lifecycle_state="running", entries_replayed=3

Scenario 3: No journal entries¶

recovered = await engine.recover("new-process")
# {"context_state": {}, "lifecycle_state": "created", "last_entry_type": "", "entries_replayed": 0}

API Summary¶

Method	Description
`ReplayEngine(journal)`	Create with a `JournalProvider`
`await recover(process_id)`	Recover state from journal entries

Tips and Gotchas¶

Frequent checkpoints reduce replay time

The replay engine only needs to process entries after the last checkpoint. More frequent checkpoints mean fewer entries to replay, but more disk writes. The default "checkpoint" journal level creates a checkpoint after each trigger-invoke-result cycle.

Use full journal level for detailed recovery

With level="full", the journal captures every context_update, giving the replay engine fine-grained state reconstruction. With level="checkpoint", only the checkpoint snapshots are available.

Replay does not restore memory providers

The replay engine reconstructs context_state (key-value store) but does not restore the MemoryProvider. Long-term memory is managed by its own persistence layer (ChromaDB, Mem0, etc.) and does not need journal-based recovery.

Replay does not restore conversation buffer

The conversation buffer (short-term message history) is not replayed from journal entries. If you need conversation persistence across crashes, include it in checkpoint data.

Idempotency matters

The replay engine assumes entries can be applied sequentially. If your journal contains entries with side effects beyond state mutations (e.g., external API calls), those side effects are not replayed -- only the state changes recorded in the journal.

What's Next¶

Journal Overview -- journal concepts and entry types
Journal Backends -- FileJournal and InMemoryJournal
Lifecycle -- the state machine that produces state_transition entries
Context -- the state layer that produces context_update entries