STATEK
Error Handling

STATEK Error Handling

This guide covers errors that happen inside a durable STATEK job.

Use it for the practical boundary between normal application outcomes, inspectable job failures, and runtime cleanup that must happen when a job terminates. The most important pattern is cleanup for claimed external work: if a job takes ownership of a queued item, message, or task, the application should have a way to mark that item failed or release the claim if the job dies.

⚠️

STATEK error handlers are cleanup and notification hooks. They do not make external side effects reversible, exactly-once, or safe. Keep side effects idempotent, store durable status fields, and avoid leaking secrets or sensitive data through exception messages.

Error Taxonomy

Expected application outcomes should usually be explicit return values.

For example, "no work is available", "approval was denied", or "the item is already processed" is usually not a Python exception. Return a status the agent or worker can handle:

class FetchResult:
    status: str
    item: WorkItem | None = None
 
 
result = fetch_next_item(owner)
 
if result.status == "empty":
    return "No item is available right now."

Unexpected failures are different:

  • tool exceptions are captured as tool errors and counted by harness exception policies
  • Python execution exceptions are stored in the job's execution history and console state
  • FutureError is for future or temporal suspension, not generic tool failure
  • LLM_HarnessError stops the job through the critical-error path
  • critical worker or job failures invoke registered job error handlers

The application should decide which failures are recoverable by the agent and which ones are operational failures that need durable cleanup.

Tool Errors

If a tool raises, STATEK records the exception type and message in the tool result. The agent can see that failure in job history, and harness exception policies count repeated tool failures.

Prefer structured results when the model can take a useful next step:

@tool
def fetch_next_item(owner, **kwargs):
    """Return the next item for this worker, or an empty status."""
    item = WorkItem.claim_next(owner)
    if item is None:
        return {"status": "empty"}
    return {"status": "claimed", "owner": owner, "item": item}

Raise when the tool really failed and the failure should be visible as an exception:

@tool
def process_item(item, **kwargs):
    """Process a claimed work item."""
    if item.status != "CLAIMED":
        raise RuntimeError("item is not claimed by this job")
    return item.process()

Avoid putting credentials, raw user data, webhook bodies, or private infrastructure details into exception messages. Those messages can become part of persisted job history and model-visible context.

Job Error Handlers

Use @error_handler for runtime cleanup that should run if a job terminates through the critical-error path.

The verified handler protocol is:

  • @error_handler marks a handler
  • the handler name must start with _
  • the handler signature is (context, error=None)
  • @tool(error_handler=_handler) registers the handler with the tool result as context
  • if the tool returns FutureResult, registration is deferred until .value resolves
  • Job.notify_handlers(error=...) invokes registered handlers and clears the list
  • handler exceptions are suppressed

Because handler exceptions are suppressed, keep handlers small. If cleanup is critical, the handler should persist its own status, tags, timestamps, or audit record so an operator can see whether cleanup ran.

Claim and Cleanup Pattern

A common pattern is a worker tool that claims one durable item and registers cleanup for that claimed context.

from statek import error_handler, tool
 
 
class QueueOwner:
    id: str
 
 
class WorkItem:
    status: str
    owner_id: str | None
    error_message: str | None
 
    @classmethod
    def claim_next(cls, owner):
        ...
 
    def mark_error(self, message):
        self.status = "ERROR"
        self.error_message = message
 
    def release_claim(self):
        self.status = "READY"
        self.owner_id = None
 
 
@error_handler
def _mark_item_error(context, error=None):
    owner, item = context
    if item.status == "CLAIMED" and item.owner_id == owner.id:
        item.mark_error(str(error) if error else "job terminated")
 
 
@tool(error_handler=_mark_item_error)
def fetch_next_item(owner: QueueOwner, **kwargs):
    """Claim and return the next work item for this worker."""
    item = WorkItem.claim_next(owner)
    if item is None:
        return None
    return owner, item

When fetch_next_item(...) returns (owner, item), STATEK registers _mark_item_error on the current job with that tuple as handler context. If the worker later fails through critical job termination, the handler can mark the item ERROR or release the claim.

The handler is not a normal LLM-facing capability. It is an internal runtime cleanup path associated with the tool result. If the cleanup function should not be visible in prompts or provider tool payloads, do not register it as a public tool. If you expose related operational functions, make them explicit administrative tools with their own permissions.

Future-Returning Claims

If a tool returns FutureResult, STATEK does not have the claimed context yet. The error handler is bound to the future and registered only when the future's .value resolves successfully.

@tool(error_handler=_mark_item_error)
def wait_for_next_item(owner: QueueOwner, **kwargs):
    """Return a future that resolves to (owner, item) when work is available."""
    return next_item_future(owner)

When the result is not ready, accessing the future raises FutureError and the job can suspend. That suspension is not a generic failure. Once the future resolves to (owner, item), STATEK registers _mark_item_error with the resolved tuple as context.

Use this only when future polling is an acceptable fit for the wait. For product-facing event delivery, callbacks and queues are usually a better public integration model.

Harness Failures

The LLM harness enforces per-job limits for turns, token usage, total exceptions, and consecutive exceptions. When a limit is exceeded, it raises LLM_HarnessError.

In the normal worker path, STATEK records the error in the job console, sets an exit status, marks the job DONE, and calls critical-error handling. Registered job error handlers are invoked with the harness error.

This makes harness failures a strong reason to register cleanup for claimed external work. A job can hit a limit after it has already claimed an item, called a tool, or updated application state.

Delegated Jobs

Child jobs can inherit parent error handlers when they are created with a parent job. Use this when delegated work still owns parent-claimed context or must participate in the same cleanup path.

Be explicit about ownership. If both the parent and child can register cleanup for the same external item, design the cleanup operation to be idempotent and guarded by durable ownership checks:

if item.status == "CLAIMED" and item.owner_id == owner.id:
    item.mark_error("delegated job terminated")

That check matters because inherited handlers and newly registered child handlers can otherwise duplicate cleanup intent.

Warmup Failures

Warmup runs before normal LLM turns. Warmup failures are startup or job-definition failures, not normal LLM-turn recoverable errors.

Do not rely on LLM-turn exception counters for warmup cleanup. If warmup claims external work, either keep claim acquisition outside fragile warmup code or use a verified tool and handler pattern so the claim registers cleanup when it succeeds.

For queue-picking warmup, prefer narrow, deterministic code:

claim = fetch_next_item(owner) #STATEK: as tool
if claim is None:
    print("No item is available.")
else:
    owner, item = claim
    print("Claimed:", item)

If "no item available" is expected, return or wait explicitly. Reserve exceptions for actual startup failures that should be visible in job history or operational logs.

Design Rules

Use explicit return values for normal business outcomes. Let the agent handle statuses it can reasonably act on.

Use exceptions for unexpected failures. They are inspectable and counted by harness policies.

Use job error handlers for cleanup of durable external ownership, notifications, or failure marking. They are not a retry framework.

Use durable state for cleanup decisions. Status fields, owner IDs, idempotency keys, tags, and audit logs are more reliable than trying to infer recovery from an exception string.

Keep secrets out of exceptions and handler context that may appear in persisted history or operational views.

Where to go next

Read Jobs for lifecycle and history, Tools for tool errors and hidden tools, Harness Policies for exception limits, Futures for suspension behavior, Warmup Code for startup patterns, and Subtasks for delegated work.