Act as Incident Commander for SEV-1 and SEV-2 incidents, coordinating responders across multiple teams, owning communications to stakeholders, and driving the incident to resolution.
Serve as a senior escalation point in the on-call rotation for both application services and data platform alerts; provide technical leadership when the primary on-call needs support.
Make rapid, high-stakes decisions on mitigation strategies (rollback, failover, traffic shifting, kill-switches) based on incomplete information.
Reliability Engineering & Remediation
Design and implement durable fixes, work alongside core engineering teams to harden services and pipelines against recurrence.
Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets across critical services and data products.
Drive measurable improvements in MTTR, MTTD, incident frequency, and on-call load through systematic engineering effort.
Post-Incident Review & Knowledge Sharing
Lead blameless post-mortems for major incidents; produce high-quality written analyses covering timeline, root cause, contributing factors, and corrective actions.
Track post-incident action items to completion and ensure systemic learnings translate into platform improvements.
Mentor mid-level and junior SREs on triage technique, debugging methodology, and incident communication.
Tooling, Automation & Platform Work
Refine alerting rules, dashboards, and runbooks to improve signal-to-noise ratio and reduce on-call fatigue.
Design and build internal tooling that automates triage, diagnostics, and remediation workflows.
Contribute to the observability stack (metrics, logs, traces) to ensure incidents can be diagnosed quickly and confidently.
Deployment & Production Readiness
Own the deployment of infrastructure and applications across environments, ensuring correct configuration, security posture, and adherence to deployment standards.
Define and enforce production readiness criteria for new services and data pipelines before they are accepted into the on-call rotation.
Champion safe deployment practices including progressive rollouts, automated rollback, and pre-production validation.
Requirements
Experience
7+ years of professional experience in Site Reliability Engineering, Production Engineering, DevOps, or a Software Engineering role with significant production ownership.
Working knowledge of at least one major cloud platform (AliCloud, AWS, GCP, or Azure) and modern infrastructure-as-code practices (e.g., Terraform).
Proficiency with observability tooling such as Prometheus, Grafana, Datadog, the ELK stack, or equivalents.
Own the deployment of infrastructure and applications across environments, ensuring correct configuration, security posture, and adherence to deployment standards.
Advanced SQL skills and experience diagnosing data correctness, latency, and pipeline failure issues.
Solid understanding of distributed systems concepts: consistency, availability, partitioning, queueing, backpressure, and failure propagation.
Nice to Have
Experience with Kubernetes and container orchestration in production.
Experience defining SLO frameworks or error-budget policies from scratch.