Ace Cloud Interviews
All templates

Kiran Patel

Site Reliability Engineer

Mumbai, India | your.email@example.com | +91 98XXX XXXXX | linkedin.com/in/your-profile

Professional Summary

Site Reliability Engineer with 5 years owning production reliability for high-traffic financial platforms. At Payvault Technologies, maintains 99.95% uptime on a payment processing platform handling 1M+ daily transactions - exceeding a published 99.9% SLA for 18 consecutive months. Reduced P1/P2 incident frequency by 60% through SLO-driven alert remediation and error budget governance; cut monthly toil from 40 hours to 7.5 hours through targeted automation. Takes a data-driven approach: measures error budgets, replaces metric-based noise with symptom-based alerting, and prioritises reliability work over reactive firefighting.

Skills and Expertise

Reliability Engineering: SLO / SLI / Error Budget definition, Incident command and postmortems, Toil measurement and elimination, Chaos engineering (AWS FIS), On-call rotation management, Runbook authoring
Observability: Prometheus, Grafana, Alertmanager, Loki, Jaeger, PagerDuty, Opsgenie
Kubernetes and Infrastructure: EKS (operations and upgrades), Helm, ArgoCD, Terraform, CloudFormation
Cloud - AWS: EC2, RDS, S3, CloudWatch, VPC, IAM, ALB
Automation: Python, Bash, Go (basic), AWS Lambda automation

Certifications

  • Certified Kubernetes Administrator (CKA)Feb 2023 - Feb 2026
  • AWS Certified Solutions Architect - AssociateMay 2022 - May 2025
  • HashiCorp Certified: Terraform AssociateAug 2022 - Aug 2025

Work Experience

Site Reliability Engineer - Payvault Technologies
Sep 2022 - Present
  • Owns production reliability for a payment processing platform handling 1M+ daily transactions (peak 600 TPS during settlement windows); maintained 99.95% monthly uptime across 18 consecutive months, consistently exceeding the published 99.9% SLA and avoiding an estimated INR 4.2 crore in SLA penalty exposure.
  • Defined SLOs and SLIs for 6 critical services in collaboration with product owners; error budget tracking shifted 30% of sprint capacity from reactive firefighting to proactive reliability work within 2 quarters, contributing to a 60% reduction in P1/P2 incident frequency over 12 months.
  • Manages 20+ services across EKS clusters including version upgrades (v1.24 to v1.29), HPA tuning, pod disruption budget configuration, and node group scaling; completed 4 cluster upgrade cycles with zero unplanned service disruptions.
  • Rebuilt alerting across the Prometheus/Grafana/Alertmanager stack, replacing 120 metric-threshold alerts with 28 symptom-based alerts on user-facing error rates and latency; on-call wake-ups fell from 11 per week to under 3, with zero missed genuine incidents.
  • Led incident command for 6 Severity 1 incidents over 18 months; ran blameless postmortems and tracked all action items to closure; mean time to resolve P2 incidents fell from 90 minutes to 35 minutes, measured over 40 incidents.
  • Eliminated 32.5 hours of monthly toil by automating 6 recurring operational tasks in Python and Bash (RDS failover validation, certificate renewal checks, deployment health gates, stale resource audits, snapshot lifecycle); engineer capacity recovered was reallocated to SLO improvement work.
  • Ran a chaos engineering exercise on 3 services using AWS Fault Injection Simulator (pod deletion, network latency injection, and AZ impairment scenarios); uncovered 3 resilience gaps (missing retry logic, absent circuit breakers, misconfigured readiness probes) and closed all 3 before they caused a production incident.
DevOps / SRE Engineer - Skybridge India
Jun 2020 - Sep 2022
  • Shared on-call rotation for a B2B SaaS product serving 800 enterprise customers; reduced first-response time to production alerts from 22 minutes to under 6 minutes by improving alert routing and on-call escalation policies.
  • Migrated monitoring from CloudWatch-only to a Prometheus/Grafana stack; set up scraping and dashboards for 8 microservices, giving the team per-service visibility into latency and error rates for the first time.
  • Authored Terraform modules for EKS, ALB, RDS, and VPC; reduced infrastructure provisioning time for new environments from 4 days to 3 hours, used on 5 new environments in the first 6 months.
  • Reduced EKS compute costs by 22% by implementing Cluster Autoscaler and rightsizing pod resource requests based on 30-day Prometheus utilisation data, without any increase in OOM-kill or throttling incidents.
Junior Systems Engineer - Infracore Corp
Jul 2019 - Jun 2020
  • Managed EC2, RDS, and load balancer infrastructure for a legacy web application serving 50,000 monthly users; maintained 99.7% availability across the year.
  • Migrated a self-managed MySQL instance to RDS Multi-AZ; eliminated a single point of failure that had caused 3 unplanned outages in the previous year, each lasting 45-90 minutes.

Education

Bachelor of Engineering (BE) - Electronics and Telecommunications - Mumbai Metropolitan University
2015 - 2019
Customise this template: Replace the name, contact details, company names, and bullet points with your own experience. The structure and phrasing are designed to read well to both recruiters and hiring managers in cloud and DevOps roles.