SDK Resilience

TapPass should not become a single point of failure in your AI agent hot path. The SDK includes a resilience layer that handles TapPass outages gracefully.

Quick start

from tappass import Agent
from tappass.resilience import ResiliencePolicy, FailMode

policy = ResiliencePolicy(
    mode=FailMode.FAIL_OPEN_CACHED,
    cache_ttl_seconds=300,
    max_offline_requests=100,
)

agent = Agent("https://tappass.example.com", "tp_...", resilience=policy)

When TapPass is unreachable, the agent continues with the last-known-good policy and queues audit entries locally.

Fail modes

Mode	Behavior	Use case
`FAIL_CLOSED`	Agent stops. Raises `TapPassConnectionError`.	Regulated environments where ungoverned operation is unacceptable. Default.
`FAIL_OPEN_CACHED`	Agent continues with last-known-good cached response. Audit entries queued locally.	Production agents that must remain available.
`FAIL_OPEN_LOGGED`	Agent continues ungoverned. Every call logged locally as `DEGRADED`.	Development and low-risk agents.

Choosing the right mode

Financial, healthcare, legal agents — Use FAIL_CLOSED. Better to stop than to operate without governance.
Customer support, internal tools — Use FAIL_OPEN_CACHED. Availability matters; cached policy is acceptable for short outages.
Development, testing — Use FAIL_OPEN_LOGGED. Maximum availability, post-incident audit is sufficient.

Circuit breaker

The SDK includes a circuit breaker that tracks consecutive failures to TapPass. This prevents request storms during outages.

States

CLOSED ──── requests flow normally
   │
   │  N consecutive failures (default: 3)
   ▼
OPEN ────── use fallback (cache or fail-closed)
   │
   │  recovery timeout expires (default: 30s)
   ▼
HALF_OPEN ── probe one request
   │
   ├── success → CLOSED (reset)
   └── failure → OPEN (restart timer)

Configuration

ResiliencePolicy(
    mode=FailMode.FAIL_OPEN_CACHED,
    circuit_failure_threshold=3,     # failures before circuit opens
    circuit_recovery_timeout=30.0,   # seconds before probing recovery
)

Parameter	Default	Env var	Description
`circuit_failure_threshold`	3	`TAPPASS_CIRCUIT_FAILURE_THRESHOLD`	Consecutive failures before the circuit opens
`circuit_recovery_timeout`	30.0	`TAPPASS_CIRCUIT_RECOVERY_TIMEOUT`	Seconds before transitioning from OPEN to HALF_OPEN

Checking circuit state

status = agent.resilience.status()
# {
#   "mode": "fail_open_cached",
#   "circuit": {"state": "closed", "consecutive_failures": 0, ...},
#   "cache_size": 12,
#   "buffered_audit_entries": 0,
#   ...
# }

Local audit buffer

When TapPass is unreachable, the SDK writes audit entries to a local JSONL file. When connectivity recovers, the buffer is flushed to the server.

How it works

Agent makes a governed call.
TapPass is unreachable — circuit breaker is open.
SDK serves a cached or degraded response.
The audit entry is written to a local JSONL file (append-only, fsync per write).
When TapPass recovers (circuit closes), buffered entries are flushed.

Configuration

ResiliencePolicy(
    local_audit_path=".tappass_audit_buffer.jsonl",
    max_offline_requests=100,  # hard cap on degraded-mode calls (0 = unlimited)
)

Parameter	Default	Env var	Description
`local_audit_path`	`.tappass_audit_buffer.jsonl`	`TAPPASS_LOCAL_AUDIT_PATH`	Path for the local audit buffer file
`max_offline_requests`	100	`TAPPASS_MAX_OFFLINE_REQUESTS`	Hard cap on calls allowed in degraded mode

Encryption

The local audit buffer can be encrypted at rest. Set the TAPPASS_AUDIT_BUFFER_KEY environment variable to enable AES-256-GCM encryption:

export TAPPASS_AUDIT_BUFFER_KEY="your-secret-key-here"

Without this variable, the buffer is written as plaintext JSONL (the SDK logs a warning on first write).

Retry strategies

The SDK uses the max_retries parameter on the Agent constructor for transient errors:

agent = Agent(
    "https://tappass.example.com",
    "tp_...",
    timeout=30,       # per-request timeout in seconds
    max_retries=2,    # retry on transient errors
)

Retries are triggered by:

Connection timeouts
HTTP 502, 503, 504 responses
Network errors

Retries are not triggered by:

Policy blocks (4xx responses)
Authentication failures
Successful responses with pipeline detections

Cache behavior

The SDK caches successful TapPass responses for potential fallback during outages. This is a safety net, not a performance cache.

How caching works

Cache key: SHA-256(model + last_user_message[:500])
Max entries: 50 (LRU eviction)
TTL: 300 seconds (configurable via cache_ttl_seconds)
When served: Only when TapPass is unreachable AND mode is FAIL_OPEN_CACHED

Degraded responses

When the SDK serves a cached or degraded response, it adds metadata:

response = agent.chat("Analyze Q3 revenue")

if response.degraded:
    print(response.degraded_reason)
    # "TapPass unreachable — ConnectionError (serving from cache)"

The response includes:

Field	Description
`response.degraded`	`True` if served from cache or in degraded mode
`response.degraded_reason`	Human-readable explanation
`response.tappass.cache_age_seconds`	How old the cached response is
`response.tappass.classification`	`"DEGRADED"` for ungoverned responses

Environment variable configuration

All resilience settings can be configured via environment variables (useful for containerized deployments):

export TAPPASS_FAIL_MODE=fail_open_cached
export TAPPASS_CACHE_TTL=300
export TAPPASS_MAX_OFFLINE_REQUESTS=100
export TAPPASS_LOCAL_AUDIT_PATH=.tappass_audit_buffer.jsonl
export TAPPASS_CIRCUIT_FAILURE_THRESHOLD=3
export TAPPASS_CIRCUIT_RECOVERY_TIMEOUT=30
export TAPPASS_AUDIT_BUFFER_KEY=your-encryption-key

Then load from environment:

from tappass import Agent
from tappass.resilience import ResiliencePolicy

agent = Agent(
    "https://tappass.example.com",
    "tp_...",
    resilience=ResiliencePolicy.from_env(),
)

Architecture diagram

Agent Request
    │
    ▼
[1] TapPass Primary (circuit breaker: 3 failures → open)
    │  ↓ timeout / connection error
    ▼
[2] SDK Local Cache (last-known-good response, configurable TTL)
    │  ↓ cache miss or expired
    ▼
[3] Fail-Closed (raise TapPassConnectionError)
    │
    ▼
[Audit] All degraded calls logged locally, synced on recovery