Description
- Site Reliability Engineer Description
- Role summary: Senior SRE responsible for ensuring service availability, performance, and scalability across distributed systems.
- Core focus: Design and operate production systems with emphasis on reliability, automation, and measurable SLIs/SLOs.
- Day‑to‑day: Build runbooks, automate toil, own incident response, perform RCA, and drive postmortem culture.
- Technical skills: Deep expertise in Linux, networking, containers, Kubernetes, CI/CD, and cloud platforms (AWS/Azure/GCP).
- Observability: Implement and tune logging, metrics, tracing, and alerting to reduce MTTD and MTTR.
- Automation: Create infrastructure as code, self‑healing patterns, and run automated capacity and chaos tests.
- Performance and capacity: Lead capacity planning, cost optimization, and performance tuning for high‑traffic services.
- Security and compliance: Integrate security into pipelines, manage secrets, and ensure production compliance controls.
- Architecture influence: Collaborate with dev teams to design fault‑tolerant, scalable architectures and release strategies.
- Leadership: Mentor junior SREs, run on‑call rotations, and shape SRE practices and hiring for the team.
- Advanced responsibilities: Drive platform engineering, SRE tooling, reliability roadmaps, and cross‑org reliability SLAs.
- Impact metrics:




