STATEK
Replay and Recovery

STATEK Replay and Recovery

STATEK persists agent execution state. It does not replay the outside world.

That distinction matters. A STATEK job can keep Python variables, chat history, console output, tool results, pushed messages, errors, usage, and continuation fields. A dbzero-backed application object can keep durable business state. After a process restart, a worker can find persisted jobs and keep processing them.

But if a tool already sent an email, moved a meeting in an external calendar, wrote a file, charged a card, or called a third-party API, STATEK does not make that side effect deterministic, reversible, or exactly-once by itself.

Use STATEK durability as the persisted record of what happened and what the agent knew. Use application-level recovery rules for external systems.

What STATEK Persists

A job is a durable Python workspace.

At the job level, STATEK persists:

  • JobDef: the frozen definition of work, including agent, metadata, prompt params, warmup code, chat style, and locale
  • Job: the durable execution unit and current status
  • PyEnv.local_state: Python variables the job can reuse later
  • PyEnv.global_state: globals available to executed code
  • PyEnv.console: output captured from Python print(...)
  • PyEnv.push_log: messages pushed into an active job
  • chat_log: user messages, LLM responses, warmup entries, and subtask notifications
  • tool_log: tool results and tool errors attached to chat items
  • usage: provider usage and cost where available
  • error state and exception history
  • parent and child job relationships
  • awaited_result and next_instr_num for suspended execution

At the application level, dbzero-backed Python objects are durable application state:

user
calendar
meeting
dataset
analysis_result

If those are dbzero-backed objects, changes to them are persisted according to your dbzero runtime and storage configuration.

What This Gives You

The practical result is inspection and continuation.

If a job started with:

user = current_user
calendar = user.calendar
today = request_day

and later created:

events = calendar.events_for(today)
meeting = calendar.find_meeting("planning", day=today)
slot = calendar.find_empty_slot(after=meeting.ends_at)

the job can keep those variables in its Python state. Operators can inspect the history: what the user asked, what the model answered, what code ran, what was printed, what tools returned, and where the job paused or failed.

That is different from rerunning a script from the beginning. STATEK stores the workspace and history that exist now.

Job Status After Restart

After a process restart, a STATEK worker can look at persisted jobs by status.

  • READY: the job exists and can be picked up by the worker loop.
  • WARMING_UP: startup code was in progress or still needs to advance.
  • STARTED: normal job execution can continue from persisted state and history.
  • SUSPENDED: the job is waiting for a future or external condition.
  • DONE: the job's current unit of work is complete.

The worker loop schedules runnable jobs and checks suspended jobs. If a suspended job's awaited condition is ready, the job can move back to STARTED.

For callbacks and user follow-ups, a pushed message can also reactivate a DONE job. The message is added to the job history or push log, the job returns to STARTED, and the agent can continue from the persisted context.

Suspended Jobs

Suspension preserves where a job paused.

When Python reaches a future-like value that is not ready, STATEK stores:

  • awaited_result: what the job is waiting for
  • next_instr_num: where Python execution should continue
  • the job status, usually SUSPENDED

Later, the worker checks whether the awaited result is ready. If it is, the job continues from the stored point rather than repeating earlier instructions.

The current futures implementation uses polling for suspended jobs. Keep waits bounded and cheap, and prefer callbacks or event queues for broad external event delivery.

Durable State Is Not Deterministic Replay

Durable state means STATEK remembers persisted Python objects and job history.

Deterministic replay would mean STATEK can safely rerun past execution and recreate the same world, including every external side effect. That is not what STATEK promises.

Consider this conceptual code:

meeting.move_to(slot)
send_confirmation_email(user, meeting)

If meeting is dbzero-backed, the meeting update can be durable application state. If send_confirmation_email delivered an email, that external side effect is outside STATEK's durable object model.

If the process crashes after the email is sent but before the next durable marker is written, your application must decide what recovery means. It might check an email provider event ID, inspect a durable notification_sent field, resend only when safe, or create a follow-up task for an operator.

Design Idempotent Tools

Treat side-effecting tools like distributed-system write paths.

A side-effecting tool should be safe to retry or safe to detect as already completed. Prefer durable state transitions around external calls:

if meeting.move_request_id is None:
    meeting.move_request_id = external_calendar.move(meeting, slot)
    meeting.status = "move_requested"

Then later code can check current state before applying another change:

if meeting.status == "move_requested":
    result = external_calendar.lookup(meeting.move_request_id)
    if result.completed:
        meeting.status = "moved"

The exact fields are application-specific. The pattern is what matters:

  • store external operation IDs
  • persist intent before or around side effects
  • check durable state before retrying
  • make duplicate callbacks safe
  • make late callbacks safe
  • record failures explicitly

Separate Planning From Committing

Agents are good at planning with Python state:

meeting = calendar.find_meeting("planning", day=today)
slot = calendar.find_empty_slot(after=meeting.ends_at)
summary = describe_move(meeting, slot)

Committing a risky action should usually go through a controlled boundary:

approval = request_human_approval(user, summary)

and then:

if approval.accepted:
    move_meeting_after_approval(meeting, slot)

This gives recovery a durable shape. You can inspect what was proposed, who approved it, whether the action started, and whether it completed.

Compensating Actions

Some side effects cannot be undone.

If an agent sends a message to a user, calls an external service, or publishes a report, the recovery path may be a compensating action instead of a rollback.

Examples:

  • send a correction message
  • restore a previous calendar time
  • mark an export as failed and start a replacement
  • create a manual review task
  • record that an external action happened but could not be reversed

Store compensation state in application objects. Put the explanation in job history. The operator should be able to answer: what did the agent believe, what did it do, what persisted, what external side effect happened, and what recovery action followed.

Snapshots and Inspection

dbzero-backed state gives your application durable objects that can be inspected according to your dbzero setup.

For STATEK work, that means inspection can include:

  • job objects and statuses
  • job definitions and frozen metadata
  • Python local state for jobs
  • chat history, console output, tool results, errors, and usage
  • dbzero-backed application objects referenced by jobs

Use that for debugging and audit. Do not assume it gives you a public API for forking jobs from arbitrary historical states unless that capability is explicitly implemented and documented for your version.

Failure Scenarios

Use these mental models when designing recovery:

  • If a job fails before a side effect, retrying may be safe.
  • If a job fails after a dbzero object update, the durable object may already be changed.
  • If a job fails after an external API call, the external system may already have acted.
  • If a callback is delivered twice, the handler should check durable state before writing.
  • If a suspended job waits forever, your application needs timeout or escalation policy.
  • If a child job fails, the parent should receive a durable error signal and decide what to do next.
  • If a process restarts, the worker can continue from persisted job state, but external systems still need reconciliation.

Recovery Checklist

For every side-effecting workflow, decide:

  • what durable object records the intent
  • what durable object records completion
  • what external ID proves the side effect happened
  • whether retry is safe
  • what duplicate callbacks should do
  • what timeout means
  • what an operator should inspect
  • what compensating action is possible
⚠️

STATEK durable history is not deterministic replay. It does not guarantee exactly-once external side effects, automatic rollback, or safe re-execution of third-party calls. Production recovery needs idempotent tools, durable status fields, authorization, audit logs, backups, and operational monitoring. See Security for the code-execution and credential boundary.

Practical Rule

Use STATEK to persist the agent's Python workspace and execution record.

Use dbzero objects as the durable source of application truth.

Use explicit application recovery rules for anything outside that durable object model: APIs, files, email, payments, notifications, human approvals, and other systems that may act independently of the STATEK process.

Where to go next

Read Security for runtime boundaries, Operations for worker and restart expectations, Tools for side-effecting function design, and Callbacks and Interruptions for external event handling.