⚠️ This post links to an external website. ⚠️
Happy New Years for my APAC friends! In my previous post we traced the evolution from synchronous request-reply to enterprise event delivery platforms. We built recipient lists, webhook registrations, and bundled integrations. But we glossed over a critical question: what happens when delivery fails?
In production, failure isn’t exceptional. It’s constant. Endpoints go down. Networks partition. Rate limits trigger. Servers restart. A webhook platform that doesn’t handle failure gracefully isn’t a platform at all.
The telephone network solved these problems decades ago. When you dialed a number and got a busy signal, the network didn’t keep hammering the line. When trunk lines overloaded, calls were routed to overflow groups. When entire exchanges failed, traffic was isolated so one neighborhood’s outage didn’t take down the city.
Today we’re building the reliability layer using patterns the telecom industry perfected: retry with exponential backoff, dead letter queues, circuit breakers, bulkheads for tenant isolation, claim check for bandwidth optimization, and batching for efficiency.
continue reading on james-carr.org
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.