SDK Resilience
TapPass should not become a single point of failure in your AI agent hot path. The SDK includes a resilience layer that handles TapPass outages gracefully.
Quick start
Section titled “Quick start”from tappass import Agentfrom tappass.resilience import ResiliencePolicy, FailMode
policy = ResiliencePolicy( mode=FailMode.FAIL_OPEN_CACHED, cache_ttl_seconds=300, max_offline_requests=100,)
agent = Agent("https://tappass.example.com", "tp_...", resilience=policy)When TapPass is unreachable, the agent continues with the last-known-good policy and queues audit entries locally.
Fail modes
Section titled “Fail modes”| Mode | Behavior | Use case |
|---|---|---|
FAIL_CLOSED | Agent stops. Raises TapPassConnectionError. | Regulated environments where ungoverned operation is unacceptable. Default. |
FAIL_OPEN_CACHED | Agent continues with last-known-good cached response. Audit entries queued locally. | Production agents that must remain available. |
FAIL_OPEN_LOGGED | Agent continues ungoverned. Every call logged locally as DEGRADED. | Development and low-risk agents. |
Choosing the right mode
Section titled “Choosing the right mode”- Financial, healthcare, legal agents — Use
FAIL_CLOSED. Better to stop than to operate without governance. - Customer support, internal tools — Use
FAIL_OPEN_CACHED. Availability matters; cached policy is acceptable for short outages. - Development, testing — Use
FAIL_OPEN_LOGGED. Maximum availability, post-incident audit is sufficient.
Circuit breaker
Section titled “Circuit breaker”The SDK includes a circuit breaker that tracks consecutive failures to TapPass. This prevents request storms during outages.
States
Section titled “States”CLOSED ──── requests flow normally │ │ N consecutive failures (default: 3) ▼OPEN ────── use fallback (cache or fail-closed) │ │ recovery timeout expires (default: 30s) ▼HALF_OPEN ── probe one request │ ├── success → CLOSED (reset) └── failure → OPEN (restart timer)Configuration
Section titled “Configuration”ResiliencePolicy( mode=FailMode.FAIL_OPEN_CACHED, circuit_failure_threshold=3, # failures before circuit opens circuit_recovery_timeout=30.0, # seconds before probing recovery)| Parameter | Default | Env var | Description |
|---|---|---|---|
circuit_failure_threshold | 3 | TAPPASS_CIRCUIT_FAILURE_THRESHOLD | Consecutive failures before the circuit opens |
circuit_recovery_timeout | 30.0 | TAPPASS_CIRCUIT_RECOVERY_TIMEOUT | Seconds before transitioning from OPEN to HALF_OPEN |
Checking circuit state
Section titled “Checking circuit state”status = agent.resilience.status()# {# "mode": "fail_open_cached",# "circuit": {"state": "closed", "consecutive_failures": 0, ...},# "cache_size": 12,# "buffered_audit_entries": 0,# ...# }Local audit buffer
Section titled “Local audit buffer”When TapPass is unreachable, the SDK writes audit entries to a local JSONL file. When connectivity recovers, the buffer is flushed to the server.
How it works
Section titled “How it works”- Agent makes a governed call.
- TapPass is unreachable — circuit breaker is open.
- SDK serves a cached or degraded response.
- The audit entry is written to a local JSONL file (append-only, fsync per write).
- When TapPass recovers (circuit closes), buffered entries are flushed.
Configuration
Section titled “Configuration”ResiliencePolicy( local_audit_path=".tappass_audit_buffer.jsonl", max_offline_requests=100, # hard cap on degraded-mode calls (0 = unlimited))| Parameter | Default | Env var | Description |
|---|---|---|---|
local_audit_path | .tappass_audit_buffer.jsonl | TAPPASS_LOCAL_AUDIT_PATH | Path for the local audit buffer file |
max_offline_requests | 100 | TAPPASS_MAX_OFFLINE_REQUESTS | Hard cap on calls allowed in degraded mode |
Encryption
Section titled “Encryption”The local audit buffer can be encrypted at rest. Set the TAPPASS_AUDIT_BUFFER_KEY environment variable to enable AES-256-GCM encryption:
export TAPPASS_AUDIT_BUFFER_KEY="your-secret-key-here"Without this variable, the buffer is written as plaintext JSONL (the SDK logs a warning on first write).
Retry strategies
Section titled “Retry strategies”The SDK uses the max_retries parameter on the Agent constructor for transient errors:
agent = Agent( "https://tappass.example.com", "tp_...", timeout=30, # per-request timeout in seconds max_retries=2, # retry on transient errors)Retries are triggered by:
- Connection timeouts
- HTTP 502, 503, 504 responses
- Network errors
Retries are not triggered by:
- Policy blocks (4xx responses)
- Authentication failures
- Successful responses with pipeline detections
Cache behavior
Section titled “Cache behavior”The SDK caches successful TapPass responses for potential fallback during outages. This is a safety net, not a performance cache.
How caching works
Section titled “How caching works”- Cache key:
SHA-256(model + last_user_message[:500]) - Max entries: 50 (LRU eviction)
- TTL: 300 seconds (configurable via
cache_ttl_seconds) - When served: Only when TapPass is unreachable AND mode is
FAIL_OPEN_CACHED
Degraded responses
Section titled “Degraded responses”When the SDK serves a cached or degraded response, it adds metadata:
response = agent.chat("Analyze Q3 revenue")
if response.degraded: print(response.degraded_reason) # "TapPass unreachable — ConnectionError (serving from cache)"The response includes:
| Field | Description |
|---|---|
response.degraded | True if served from cache or in degraded mode |
response.degraded_reason | Human-readable explanation |
response.tappass.cache_age_seconds | How old the cached response is |
response.tappass.classification | "DEGRADED" for ungoverned responses |
Environment variable configuration
Section titled “Environment variable configuration”All resilience settings can be configured via environment variables (useful for containerized deployments):
export TAPPASS_FAIL_MODE=fail_open_cachedexport TAPPASS_CACHE_TTL=300export TAPPASS_MAX_OFFLINE_REQUESTS=100export TAPPASS_LOCAL_AUDIT_PATH=.tappass_audit_buffer.jsonlexport TAPPASS_CIRCUIT_FAILURE_THRESHOLD=3export TAPPASS_CIRCUIT_RECOVERY_TIMEOUT=30export TAPPASS_AUDIT_BUFFER_KEY=your-encryption-keyThen load from environment:
from tappass import Agentfrom tappass.resilience import ResiliencePolicy
agent = Agent( "https://tappass.example.com", "tp_...", resilience=ResiliencePolicy.from_env(),)Architecture diagram
Section titled “Architecture diagram”Agent Request │ ▼[1] TapPass Primary (circuit breaker: 3 failures → open) │ ↓ timeout / connection error ▼[2] SDK Local Cache (last-known-good response, configurable TTL) │ ↓ cache miss or expired ▼[3] Fail-Closed (raise TapPassConnectionError) │ ▼[Audit] All degraded calls logged locally, synced on recovery