Sveriges mest populära poddar

alphalist.CTO Podcast – For CTOs and Technical Leaders

#113 - Faster Incident Response feat. Tim Armandpour // CTO @ PagerDuty

50 min • 5 december 2024
Planning AND Practice: The Secret to Incident Response

Plan and PRACTICE for better incident response with insights from Tim Armandpour, CTO of PagerDuty. Learn the secrets to resilience from the team that mitigated the impact of a major outage—handling a 250% traffic surge while delivering on their SLA.

Listen to find out:

  • 🛠️ Why planning AND practice are both critical for incident response.
  • 🚧 How to practice for incident response (e.g Failure Fridays with Chaos Engineering)
  • 🧑‍🤝‍🧑 Ownership: Why tech AND business teams must join post-mortems.
  • ☁️ How to mitigate the impact of your cloud provider’s lower SLA.
  • ⚓ Which architectural patterns are more resilient?
  • ⚖️ WARNING: “bend” the CAP theorem at your own risk

Listen here

TimeStamps: (00:00:00) Introduction to Alphalist Podcast (00:01:00) Meet Tim Armanpour (00:01:56) Tim's Early Career (00:06:22) Handling Major Incidents at PagerDuty (00:09:21) The Importance of Preparedness (00:13:54) Practicing Failure Scenarios (00:18:16) Resilient Infrastructure and Architectural Patterns (00:22:44) Standardization and Data Management (00:25:48) Exploring Infrastructure Resilience (00:26:20) Achieving High Availability with Lower SLA Cloud Platforms (00:29:38) Defining Meaningful SLIs (00:32:15) Assessing Incident Readiness (00:35:15) The Importance of Ownership (00:41:30) Continuous Improvement (00:43:53) Lessons from a Yogurt Business (00:48:18) Final Thoughts and Takeaways

Förekommer på
00:00 -00:00