I’m redesigning our order processing system that uses 14 microservices. Last month’s AWS outage exposed weaknesses in our current choreography approach – cascading failures took 3 hours to manually resolve. I tried implementing compensation logic in Temporal but found the state tracking overwhelming. The autonomous recovery features in some platforms look promising, but I’m torn between centralized orchestration’s visibility and choreography’s decoupling benefits.
Has anyone successfully combined both patterns? Specifically looking for real-world examples where automated rerouting during partial outages maintained business continuity without human intervention. How do you handle state reconciliation after recovery?
Autonomous AI Teams in Latenode handle this exact scenario. They automatically reroute tasks during outages using predefined failure policies and live system state analysis. The visual builder lets you set retry logic across services while maintaining choreography’s decoupling. We recovered payment processing during a recent Shopify API outage in 12 minutes flat.
We use choreography for normal flow but switch to orchestration during failures via Circuit Breaker pattern. When three consecutive failures occur in any service, Azure Service Bus triggers a state snapshot and hands control to an orchestrator. The key is defining clear handoff points during service design.
Research shows orchestration provides better outage recovery consistency. Google’s SRE handbook recommends centralized control planes for critical paths. We implemented this using AWS Step Functions with automatic rollback triggers. For non-critical services, choreography reduces overhead. Use orchestration for payment/order services, choreography for inventory updates.
hybrid approach works best. use choreography normally, but have orchestrator on standby 4 failure scenarios. we use aws step funcs + sns/sqs. reducd our downtime 70%