Client Discovery Platform Operating Model

Team Structures RACI and Ownership Skills and Training Support Model

Executive Sponsor Guide Platform Lead Guide Service Owner Guide Security Partner Guide

Engagement Paths Role-Based Paths Question Index Artifact Map Coverage Matrix

Maturity Model Discovery Interview Guide Prioritization Rubric Service Onboarding Checklist Executive Readout

Access Request Matrix Document Request List System Evidence Checklist Evidence Handling

Brownfield Platform Assessment Greenfield Platform Build Kubernetes Rescue Compliance Acceleration Cost Optimization Sprint

Pre-Engagement Planning Kickoff and Alignment Delivery Cadence Closeout and Next Steps

Discovery Workshop Architecture Review Workshop Incident Readiness Workshop Roadmap Planning Workshop

First Week Checklist Production Launch Checklist Migration Readiness Checklist Incident Review Checklist Security Review Checklist

Current-State Map Target-State Principles Ranked Backlog Decision Log

Implementation Roadmap Stakeholder Communications Change Management Handoff and Adoption

CI Pipeline Standards Release Management Continuous Deployment Progressive Delivery Feature Flag Management Dependency Management

API-to-SDK Regeneration with GitHub Actions Python Package Versioning with Semantic Release

Reference Architectures Modernization Paths Resilience Patterns Multi-Cloud and Hybrid

Migration Planning Service Migration Data Migration Pipeline Migration Decommissioning

Account and Landing Zone Resource Naming and Tagging Infrastructure Modules Managed Service Selection Backup and Disaster Recovery

Cloud Provider Comparison AWS Platform Notes Azure Platform Notes GCP Platform Notes

Kubernetes Playbook Managed Containers Playbook Serverless Playbook PaaS Playbook VM and Legacy Playbook

API and Service Design Event-Driven Integration Database and Migrations Frontend and Edge Delivery

Public Web App Internal Business App API Platform Worker and Batch Job Third-Party Integration

GitOps and Infrastructure Secrets Management Observability and SLOs Security and Governance Runtime Platform Patterns Tooling Catalog

Threat Modeling Vulnerability Management Supply Chain Security Data Protection Security Incident Response

Operational Readiness Environment Strategy Networking and Connectivity Identity and Access Incident Management Cost Management

SLO Implementation On-Call and Alerting Capacity and Performance Dependency Reliability Chaos and Game Days

Testing Strategy Local Development Architecture Decisions

Workflow Automation ChatOps and Runbook Automation Self-Service Portals AI-Assisted Engineering

AI-Native Engagement Skills Engagement Context Distiller Engagement Path Classifier Discovery Gap Finder Artifact Pack Generator Backlog Compiler

LLM Application Patterns Retrieval and Vector Search Model Evaluation and Monitoring ML Platform Operations AI Risk and Governance

Platform Product Model Compliance Evidence Data Platform Practices

Regulated Industry Readiness Healthcare and PHI Financial Services Public Sector SaaS and Customer Trust

Vendor Evaluation Tool Lifecycle Management Open Source Policy Tool Consolidation

Platform Metrics Risk Register Quarterly Business Review

Runbook Template Postmortem Template Service Catalog Template Production Readiness Template ADR Template

Example 30-60-90 Roadmap Example Risk Register Example Executive Summary Example Service Catalog Entry

Glossary Decision Matrix Common Anti-Patterns Engagement Definition of Done

Content Governance Review Cadence Editorial Style Guide Contribution Guide Gap Analysis Process

Reliability Playbooks

Dependency Reliability

Most production systems depend on databases, queues, third-party APIs, identity providers, and shared platforms. Reliability depends on how those dependencies fail and recover.

Dependency inventory

For each critical dependency, record:

Owner and support path.
Expected availability and latency.
Timeout and retry behavior.
Quotas, rate limits, and concurrency limits.
Failure mode and customer impact.
Fallback, cache, queue, or degraded behavior.
Monitoring and alert ownership.

Failure controls

Common controls include:

Timeouts shorter than user-facing deadlines.
Retries with jitter and bounded attempts.
Circuit breakers for repeated failures.
Bulkheads for resource isolation.
Queues for async buffering.
Caches for safe stale reads.
Feature flags to disable optional dependency paths.

Review map

Watchouts

Default SDK retries can exceed user-facing timeouts.
Shared dependencies can create correlated failures.
Optional dependencies often become required accidentally.
Third-party SLAs do not replace internal failure design.

Capacity and Performance

Previous Page

Chaos and Game Days

Next Page

On this page

Dependency inventory Failure controls Review map Watchouts