Automation. SREs are obsessed with automation tooling
System Architecture, including upstream and downstream dependencies
Deployment & Change Management, Canary and Release process
Resiliency strategy, such as Load and Failure testing.
Capacity Planning, Turn-ups and Turn-downs.
Performance, Efficiency & Scaling, including Availability and Latency
Instrumentation, Monitoring, Alerting & Reporting on key metrics and SLAs
Incident Response (improving the on-call experience, tools, and procedures) and Postmortem followup to honor the SLA
Operational Readiness, such as Runbooks and other Documentation, Escalation Paths, and Incident Response Training exercises.
Partner with fellow engineers to architect and build mission critical software and systems that can stand the test of scale and availability, while limiting operational overhead.
Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis.
Participate in an oncall rotation and be available for escalations.
Grit, drive and a strong feeling of ownership.
BS or MS in Computer Science or a related technical discipline. Equivalent practical experience is a reasonable substitute.
Experience with AWS, GCP or Azure
Good programming skills at least in one of Go, Java, C/C++, Python, .NET, PHP and an ability to pick up new ones.
Expert level Linux knowledge and a good understanding of its fundamentals and internals: kernel, filesystems, modern memory management, threads and processes, the user/kernel-space divide, network stack, etc.
A good understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems.
Configuration management knowledge: Puppet, Chef, etc.
Good scripting skills of at least one of bash, perl, etc.
Working knowledge of the TCP/IP stack, internet routing and load balancing.