

Site Reliability Engineer, Cloud Native
Location
Bengaluru, India
Level
Senior
Department
Engineering
Type
Full - Time
Salary
Job Description
Posted on:
April 6, 2023
SambaNova is hiring a Senior Site Reliability Engineer, Cloud Native Platform. As a site reliability engineer on this team, you will work closely alongside the platform engineering team to deploy and manage our Kubernetes-based platform at a global scale. You will lead multiple initiatives to enhance our capabilities and provide a reliable, scalable service for customers, in a hybrid deployment pattern.
Responsibilities
- Assume broad responsibilities for the successful delivery of our SambaNova services in a hybrid model including but not limited to, deployment, configuration, integrations, and ongoing operations
- Deploy, administer, and manage multiple Kubernetes clusters, both on-prem and in private cloud environments
- Lead efforts to triage, debug and fix issues related to network, storage, scheduling, applications, and systems, for proactive and reactive incident resolution and root cause analysis.
- Develop and continuously improve platform capabilities for observability, monitoring, notifications, logging, tracing, and continuous delivery with reduced toil
- Develop standard solutions that enable consistency in service delivery and proactively engage with multiple cross-functional teams to solve problems that impact service levels.
- Collaborate with the platform engineers for continuous automation of fleet-wide infrastructure and application deployments
- Determine and set SLOs for the service and build the process and tools to measure and implement the SLOs, and prevent recurring problems and undesirable service conditions.
- Participate in on-call rotation responsibilities
Job Requirements
- Bachelor and/or Master in CS /EE or related field
- 5+ years of hands-on experience as an SRE with a focus on cloud-native technologies
Additional Required Qualifications
- Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters and components.
- Strong experience configuring and administering Linux systems in cloud/Saas production environments.
- A systematic problem-solving approach to troubleshooting, and the desire to solve the root cause of common problems in 24x7 environments
- Software programming experience in one or more languages including Go/ Python
- Experience delivering infrastructure as code - Ansible, Terraform, Git, Jenkins, Helm, ArgoCD.
- Good understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP, troubleshooting network performance issues
- Experience with monitoring and logging systems such as Prometheus, Grafana, Nagios, ELK, etc. and the ability to identify new technologies as appropriate
- Experience tuning and optimizing storage solutions including Object Storage and NFS.
- Knowledge of virtualization, multiple hypervisor technologies as well as cloud computing technologies like AWS, Azure, and GCP.
- Configuration and maintenance of web servers, load balancers, databases, storage systems, and messaging systems
- Good understanding of test-driven development, continuous integration, and delivery
- A passion to design for high availability and scale, with the discipline and desire for extensive automation.
- Strong communication skills with the ability and willingness to work with diverse teams, and customers, across multiple time zones.