AegisAI
Design invariants

The four properties everything else is downstream of.

Deterministic

Same identity, same intent, same context, same data — same response. No probabilistic security decisions.

Defense in depth

Seven independent layers can each deny. A single bypass cannot leak data on its own.

Identity-propagating

The JWT's sub and tenant_id flow all the way into the SAP / IAM calls. No service-account sleight of hand.

Fail-closed by default

Infra outage (Redis, audit DB, JWKS) denies before it serves. Explicit env opt-in can toggle degraded mode for non-production.

Request pipeline

Nine stages. Any one of them can deny.

1

Rate-limit middleware

Per-user and per-tenant fixed-window counters in Redis. 429 with accurate Retry-After. Fails open on Redis hiccup — the trust system is the real gate.

2

Request-ceiling middleware

ASGI-level caps on body size (default 64 KB), URL length (default 2 KB), and per-request wall time (default 10 s). 413 / 414 / 504 before the app runs.

3

Authenticate

JWT verification with HS256 / RS256 / ES256 / PS256. JWKS rotation cached process-wide. iss, aud, exp, nbf, signature, and configured leeway all enforced.

4

Authorise

SAP BAPI_USER_GET_DETAIL walks ACTIVITYGROUPS and PROFILES; per-profile auth-objects fetched via SUSR_GET_PROFILE_AUTH_OBJECTS. Cross-cloud IAM (AWS / Azure / GCP) checked for any system the request touches.

5

Adaptive trust

Tenant-scoped Redis ledger. Frequency, scope-expansion, coverage-growth, cross-user coordination, revisit ratio. Trust score → trust level → rate-limit band and request restrictions.

6

Policy

Priority-weighted expression set evaluated by a safe AST whitelist. Rejects Call outside {len,any,all,min,max,abs,str,int,float,bool}; lambdas, comprehensions, imports, dunders. Deny-by-default; deny-wins-on-tie.

7

Plan

Intent compiler validates and structures the request; QueryPlanner emits a SafeQuery with :named placeholders. Tenant isolation rendered as a row policy on every entity.

8

Execute

MODE-gated dispatch. MODE=PRODUCTION uses the real SAP adapter (or RDS / Synapse / BigQuery). Simulation and in-memory fixtures refuse to run in production.

9

Mask

Schema-driven response firewall. Per-field classification × user clearance × auth-object gate. Drop / redact / hash / partial / aggregate. PII detected in untagged fields fails the response.

Audit

Tamper-evident HMAC chain. SHA-256 of prev_hash || canonical_json(payload), HMAC-SHA256-signed under AEGIS_AUDIT_HMAC_KEY, written under a row-level lock on the previous row. Postgres-backed; SQLite fallback in dev.

Audit chain

Every decision is a row in a hash chain you can re-walk on demand.

Postgres-backed in production, SQLite-fallback in dev. Row-level lock on append; HMAC-signed for integrity beyond just hash chaining.

Per-row computation

row_hash = sha256(prev_hash || canonical_json(payload))
hmac_sig = hmac_sha256(AEGIS_AUDIT_HMAC_KEY, row_hash)

An attacker who writes directly to the DB cannot extend the chain undetected — they would need the HMAC key, which lives in your KMS / Vault, not the database.

Verification

# Admin-gated end-to-end re-walk:
GET /api/audit/verify
  → {"ok": true, "entries_checked": 18342}

# Unauthenticated low-info probe for k8s:
GET /api/audit/integrity
  → {"ok": true, "entries_checked": 18342}

A break returns the first offending row id. The Helm chart ships a CronJob that runs this every hour and pages on ok=false.

Threat model

What we mitigate, and how.

ThreatMitigation
Forged JWTiss / aud / exp / nbf / signature verification; JWKS rotation
Token replay after expiryexp + short leeway + clock sync
Cross-tenant readtenant_id in JWT + row policy on every query
SQL injection via intentParameterised SafeQuery; no string interpolation
Scope expansion via broad intentPlanner subset check + LOW-trust aggregation deny
Field-level PII leakResponseFirewall mask from FieldTag
Inference via repeated queriesTrust coverage ratio + revisit + coordination
Coordinated cross-user attackGlobal coverage-growth + density signals
Audit tamperHMAC hash chain + verify endpoint
Trust ledger DoS (Redis down)Fail-closed by default
Oversize / slow-lorisCeiling middleware (body / path / timeout)
Request-rate abuseRate limit middleware (per user + per tenant)
Unauthenticated admin actionsrequire_admin dependency
Config drift between subsystemsSingle DataSchema registry
Default-secret deploymentStartup-time refusal in PRODUCTION when readiness blockers exist

Not yet mitigated and explicitly on the roadmap: cross-region encryption-at-rest key management, formal pen-test, DDoS upstream, and an automatic kill-switch on chain-break.

Operational runbook

What ops does when something goes wrong.

Audit chain break detected

GET /api/audit/verify returns ok:false. Immediately disable writes, capture first_break_id, isolate the DB, compare row_hash against the last known-good backup. Do not truncate.

Redis outage

Trust system denies every request by default. Bring Redis back; no data loss (trust state regenerates). To serve traffic during the outage, accept the risk and set AEGIS_TRUST_FAIL_OPEN_ON_REDIS_OUTAGE=1 on the degraded cluster only.

Postgres outage (audit store)

In PRODUCTION the pipeline returns 503 because AEGIS_AUDIT_STRICT is implicit. Fix the DB before re-enabling traffic.

JWT rotation

Rotate IdP keys; AegisAI picks up via JWKS without a restart. Set a 24-hour overlap to avoid in-flight request failures.

Rate-limit spike

Check aegis.requests{status=429} in your OTel backend. If per-user, investigate the account. If tenant-wide, raise AEGIS_RATE_LIMIT_PER_TENANT_RPM or shard the tenant.

Default secret deployed

The gateway refuses to start in PRODUCTION when readiness blockers exist. Rotate JWT_SECRET and AEGIS_AUDIT_HMAC_KEY via the configured secrets provider, then redeploy.

Want a deeper architecture review?

Bring your security team. We'll walk the threat model, the audit chain, and the SAP authority surface against your specific deployment shape.

Request an architecture review