

Principal Site Reliability Engineer - Tools and Infrastructure
Job Description
As a Site Reliability Engineer, you will ensure the high-quality delivery of our software with a rapidly growing team, by building the frameworks and tools needed by software engineers and data scientists to thoroughly validate, deploy, and monitor their code. In this role, you will be a champion for best practices and a quality mentor to the rest of the engineering organisation. A successful candidate will be a well-rounded engineer with experience in systems automation and tool development for a container based infrastructure, who thrives when solving diverse technical problems in a fast-paced environment.
Responsibilities
- Develop and maintain tools used by engineering for monitoring, event collection, and downtime tracking
- Develop deployment and automation tools to manage a rapidly growing number of services
- Drive improvements in security, scalability, reliability, and performance
- Troubleshoot large-scale distributed systems (hardware, software, applications, and network)
- Develop continuous integration systems and test frameworks
- Work with other teams to gather requirements and disseminate to the rest of the team
Job Requirements
- Extensive development using Python
- Architected and implemented scalable and secure solutions on AWS
- Operated container orchestration platforms e.g. AWS ECS/Fargate/EKS
- Built and maintained self service platforms and tooling for developers
- Developed reusable infrastructure as code modules for provisioning secure cloud infrastructure e.g. Terraform/CloudFormation
- Deployed and operated log and metrics infrastructure based on Elasticsearch, Kibana and Grafana
- Developed and operated serverless functions using AWS Lambda and gained experience using common deploy frameworks e.g. AWS SAM or serverless.com
- Designed and implemented reliable CI/CD pipelines for deploying containers, serverless functions and cloud infrastructure