Site Reliability Engineer, Cloud Native

Location

Bengaluru, India

Level

Senior

Department

Engineering

Type

Full - Time

Salary

Job Description

Posted on:

April 6, 2023

SambaNova is hiring a Senior Site Reliability Engineer, Cloud Native Platform. As a site reliability engineer on this team, you will work closely alongside the platform engineering team to deploy and manage our Kubernetes-based platform at a global scale. You will lead multiple initiatives to enhance our capabilities and provide a reliable, scalable service for customers, in a hybrid deployment pattern.

Responsibilities

Assume broad responsibilities for the successful delivery of our SambaNova services in a hybrid model including but not limited to, deployment, configuration, integrations, and ongoing operations
Deploy, administer, and manage multiple Kubernetes clusters, both on-prem and in private cloud environments
Lead efforts to triage, debug and fix issues related to network, storage, scheduling, applications, and systems, for proactive and reactive incident resolution and root cause analysis.
Develop and continuously improve platform capabilities for observability, monitoring, notifications, logging, tracing, and continuous delivery with reduced toil
Develop standard solutions that enable consistency in service delivery and proactively engage with multiple cross-functional teams to solve problems that impact service levels.
Collaborate with the platform engineers for continuous automation of fleet-wide infrastructure and application deployments
Determine and set SLOs for the service and build the process and tools to measure and implement the SLOs, and prevent recurring problems and undesirable service conditions.
Participate in on-call rotation responsibilities

‍

Job Requirements

Bachelor and/or Master in CS /EE or related field
5+ years of hands-on experience as an SRE with a focus on cloud-native technologies

Additional Required Qualifications

Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters and components.
Strong experience configuring and administering Linux systems in cloud/Saas production environments.
A systematic problem-solving approach to troubleshooting, and the desire to solve the root cause of common problems in 24x7 environments
Software programming experience in one or more languages including Go/ Python
Experience delivering infrastructure as code - Ansible, Terraform, Git, Jenkins, Helm, ArgoCD.
Good understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP, troubleshooting network performance issues
Experience with monitoring and logging systems such as Prometheus, Grafana, Nagios, ELK, etc. and the ability to identify new technologies as appropriate
Experience tuning and optimizing storage solutions including Object Storage and NFS.
Knowledge of virtualization, multiple hypervisor technologies as well as cloud computing technologies like AWS, Azure, and GCP.
Configuration and maintenance of web servers, load balancers, databases, storage systems, and messaging systems
Good understanding of test-driven development, continuous integration, and delivery
A passion to design for high availability and scale, with the discipline and desire for extensive automation.
Strong communication skills with the ability and willingness to work with diverse teams, and customers, across multiple time zones.

Apply now

More job openings

All jobs

Your Cart

Site Reliability Engineer, Cloud Native

Job Description

Responsibilities

Job Requirements

Apply now

About

SambaNova

More job openings

Senior Recruiter (GTM)

Software Engineer (Release & Engineering Efficiency)

Account Executive

Senior Accountant (TX - R2270)

Apply now

About

SambaNova