Skip to content

SDK Resilience

TapPass should not become a single point of failure in your AI agent hot path. The SDK includes a resilience layer that handles TapPass outages gracefully.

from tappass import Agent
from tappass.resilience import ResiliencePolicy, FailMode
policy = ResiliencePolicy(
mode=FailMode.FAIL_OPEN_CACHED,
cache_ttl_seconds=300,
max_offline_requests=100,
)
agent = Agent("https://tappass.example.com", "tp_...", resilience=policy)

When TapPass is unreachable, the agent continues with the last-known-good policy and queues audit entries locally.

ModeBehaviorUse case
FAIL_CLOSEDAgent stops. Raises TapPassConnectionError.Regulated environments where ungoverned operation is unacceptable. Default.
FAIL_OPEN_CACHEDAgent continues with last-known-good cached response. Audit entries queued locally.Production agents that must remain available.
FAIL_OPEN_LOGGEDAgent continues ungoverned. Every call logged locally as DEGRADED.Development and low-risk agents.
  • Financial, healthcare, legal agents — Use FAIL_CLOSED. Better to stop than to operate without governance.
  • Customer support, internal tools — Use FAIL_OPEN_CACHED. Availability matters; cached policy is acceptable for short outages.
  • Development, testing — Use FAIL_OPEN_LOGGED. Maximum availability, post-incident audit is sufficient.

The SDK includes a circuit breaker that tracks consecutive failures to TapPass. This prevents request storms during outages.

CLOSED ──── requests flow normally
│ N consecutive failures (default: 3)
OPEN ────── use fallback (cache or fail-closed)
│ recovery timeout expires (default: 30s)
HALF_OPEN ── probe one request
├── success → CLOSED (reset)
└── failure → OPEN (restart timer)
ResiliencePolicy(
mode=FailMode.FAIL_OPEN_CACHED,
circuit_failure_threshold=3, # failures before circuit opens
circuit_recovery_timeout=30.0, # seconds before probing recovery
)
ParameterDefaultEnv varDescription
circuit_failure_threshold3TAPPASS_CIRCUIT_FAILURE_THRESHOLDConsecutive failures before the circuit opens
circuit_recovery_timeout30.0TAPPASS_CIRCUIT_RECOVERY_TIMEOUTSeconds before transitioning from OPEN to HALF_OPEN
status = agent.resilience.status()
# {
# "mode": "fail_open_cached",
# "circuit": {"state": "closed", "consecutive_failures": 0, ...},
# "cache_size": 12,
# "buffered_audit_entries": 0,
# ...
# }

When TapPass is unreachable, the SDK writes audit entries to a local JSONL file. When connectivity recovers, the buffer is flushed to the server.

  1. Agent makes a governed call.
  2. TapPass is unreachable — circuit breaker is open.
  3. SDK serves a cached or degraded response.
  4. The audit entry is written to a local JSONL file (append-only, fsync per write).
  5. When TapPass recovers (circuit closes), buffered entries are flushed.
ResiliencePolicy(
local_audit_path=".tappass_audit_buffer.jsonl",
max_offline_requests=100, # hard cap on degraded-mode calls (0 = unlimited)
)
ParameterDefaultEnv varDescription
local_audit_path.tappass_audit_buffer.jsonlTAPPASS_LOCAL_AUDIT_PATHPath for the local audit buffer file
max_offline_requests100TAPPASS_MAX_OFFLINE_REQUESTSHard cap on calls allowed in degraded mode

The local audit buffer can be encrypted at rest. Set the TAPPASS_AUDIT_BUFFER_KEY environment variable to enable AES-256-GCM encryption:

Terminal window
export TAPPASS_AUDIT_BUFFER_KEY="your-secret-key-here"

Without this variable, the buffer is written as plaintext JSONL (the SDK logs a warning on first write).

The SDK uses the max_retries parameter on the Agent constructor for transient errors:

agent = Agent(
"https://tappass.example.com",
"tp_...",
timeout=30, # per-request timeout in seconds
max_retries=2, # retry on transient errors
)

Retries are triggered by:

  • Connection timeouts
  • HTTP 502, 503, 504 responses
  • Network errors

Retries are not triggered by:

  • Policy blocks (4xx responses)
  • Authentication failures
  • Successful responses with pipeline detections

The SDK caches successful TapPass responses for potential fallback during outages. This is a safety net, not a performance cache.

  • Cache key: SHA-256(model + last_user_message[:500])
  • Max entries: 50 (LRU eviction)
  • TTL: 300 seconds (configurable via cache_ttl_seconds)
  • When served: Only when TapPass is unreachable AND mode is FAIL_OPEN_CACHED

When the SDK serves a cached or degraded response, it adds metadata:

response = agent.chat("Analyze Q3 revenue")
if response.degraded:
print(response.degraded_reason)
# "TapPass unreachable — ConnectionError (serving from cache)"

The response includes:

FieldDescription
response.degradedTrue if served from cache or in degraded mode
response.degraded_reasonHuman-readable explanation
response.tappass.cache_age_secondsHow old the cached response is
response.tappass.classification"DEGRADED" for ungoverned responses

All resilience settings can be configured via environment variables (useful for containerized deployments):

Terminal window
export TAPPASS_FAIL_MODE=fail_open_cached
export TAPPASS_CACHE_TTL=300
export TAPPASS_MAX_OFFLINE_REQUESTS=100
export TAPPASS_LOCAL_AUDIT_PATH=.tappass_audit_buffer.jsonl
export TAPPASS_CIRCUIT_FAILURE_THRESHOLD=3
export TAPPASS_CIRCUIT_RECOVERY_TIMEOUT=30
export TAPPASS_AUDIT_BUFFER_KEY=your-encryption-key

Then load from environment:

from tappass import Agent
from tappass.resilience import ResiliencePolicy
agent = Agent(
"https://tappass.example.com",
"tp_...",
resilience=ResiliencePolicy.from_env(),
)
Agent Request
[1] TapPass Primary (circuit breaker: 3 failures → open)
│ ↓ timeout / connection error
[2] SDK Local Cache (last-known-good response, configurable TTL)
│ ↓ cache miss or expired
[3] Fail-Closed (raise TapPassConnectionError)
[Audit] All degraded calls logged locally, synced on recovery