backBack

Senior Specialist, Site Reliability Engineer

NEW
location

Kuala Lumpur, Malaysia

permanent

Permanent

Duties & Responsibilities

  • Incident Command & Response
  • Act as Incident Commander for SEV-1 and SEV-2 incidents, coordinating responders across multiple teams, owning communications to stakeholders, and driving the incident to resolution.
  • Serve as a senior escalation point in the on-call rotation for both application services and data platform alerts; provide technical leadership when the primary on-call needs support.
  • Make rapid, high-stakes decisions on mitigation strategies (rollback, failover, traffic shifting, kill-switches) based on incomplete information.
  • Reliability Engineering & Remediation
  • Design and implement durable fixes, work alongside core engineering teams to harden services and pipelines against recurrence.
  • Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets across critical services and data products.
  • Drive measurable improvements in MTTR, MTTD, incident frequency, and on-call load through systematic engineering effort.
  • Post-Incident Review & Knowledge Sharing
  • Lead blameless post-mortems for major incidents; produce high-quality written analyses covering timeline, root cause, contributing factors, and corrective actions.
  • Track post-incident action items to completion and ensure systemic learnings translate into platform improvements.
  • Mentor mid-level and junior SREs on triage technique, debugging methodology, and incident communication.
  • Tooling, Automation & Platform Work
  • Refine alerting rules, dashboards, and runbooks to improve signal-to-noise ratio and reduce on-call fatigue.
  • Design and build internal tooling that automates triage, diagnostics, and remediation workflows.
  • Contribute to the observability stack (metrics, logs, traces) to ensure incidents can be diagnosed quickly and confidently.
  • Deployment & Production Readiness
  • Own the deployment of infrastructure and applications across environments, ensuring correct configuration, security posture, and adherence to deployment standards.
  • Define and enforce production readiness criteria for new services and data pipelines before they are accepted into the on-call rotation.
  • Champion safe deployment practices including progressive rollouts, automated rollback, and pre-production validation.

Requirements

  • Experience
  • 7+ years of professional experience in Site Reliability Engineering, Production Engineering, DevOps, or a Software Engineering role with significant production ownership.
  • Working knowledge of at least one major cloud platform (AliCloud, AWS, GCP, or Azure) and modern infrastructure-as-code practices (e.g., Terraform).
  • Proficiency with observability tooling such as Prometheus, Grafana, Datadog, the ELK stack, or equivalents.
  • Own the deployment of infrastructure and applications across environments, ensuring correct configuration, security posture, and adherence to deployment standards.
  • Advanced SQL skills and experience diagnosing data correctness, latency, and pipeline failure issues.
  • Solid understanding of distributed systems concepts: consistency, availability, partitioning, queueing, backpressure, and failure propagation.
  • Nice to Have
  • Experience with Kubernetes and container orchestration in production.
  • Experience defining SLO frameworks or error-budget policies from scratch.