Site Reliability Engineer, Cloud Native

Job Description

Posted on: 
April 6, 2023

SambaNova is hiring a Senior Site Reliability Engineer, Cloud Native Platform. As a site reliability engineer on this team, you will work closely alongside the platform engineering team to deploy and manage our Kubernetes-based platform at a global scale. You will lead multiple initiatives to enhance our capabilities and provide a reliable, scalable service for customers, in a hybrid deployment pattern.


  • Assume broad responsibilities for the successful delivery of our SambaNova services in a hybrid model including but not limited to, deployment, configuration, integrations, and ongoing operations
  • Deploy, administer, and manage multiple Kubernetes clusters, both on-prem and in private cloud environments
  • Lead efforts to triage, debug and fix issues related to network, storage, scheduling, applications, and systems, for proactive and reactive incident resolution and root cause analysis.  
  • Develop and continuously improve platform capabilities for observability, monitoring, notifications, logging, tracing, and continuous delivery with reduced toil
  • Develop standard solutions that enable consistency in service delivery and proactively engage with multiple cross-functional teams to solve problems that impact service levels.
  • Collaborate with the platform engineers for continuous automation of fleet-wide infrastructure and application deployments
  • Determine and set SLOs for the service and build the process and tools to measure and implement the SLOs, and prevent recurring problems and undesirable service conditions.
  • Participate in on-call rotation responsibilities

Job Requirements

  • Bachelor and/or Master in CS /EE or related field
  • 5+ years of hands-on experience as an SRE with a focus on cloud-native technologies

Additional Required Qualifications

  • Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters and components.  
  • Strong experience configuring and administering Linux systems in cloud/Saas production environments.
  • A systematic problem-solving approach to troubleshooting, and the desire to solve the root cause of common problems in 24x7 environments
  • Software programming experience in one or more languages including Go/ Python
  • Experience delivering infrastructure as code - Ansible, Terraform, Git, Jenkins, Helm, ArgoCD.
  • Good understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP, troubleshooting network performance issues
  • Experience with monitoring and logging systems such as Prometheus, Grafana, Nagios, ELK, etc. and the ability to identify new technologies as appropriate
  • Experience tuning and optimizing storage solutions including Object Storage and NFS.
  • Knowledge of virtualization, multiple hypervisor technologies as well as cloud computing technologies like AWS, Azure, and GCP.
  • Configuration and maintenance of web servers, load balancers, databases, storage systems, and messaging systems
  • Good understanding of test-driven development, continuous integration, and delivery
  • A passion to design for high availability and scale, with the discipline and desire for extensive automation.
  • Strong communication skills with the ability and willingness to work with diverse teams, and customers, across multiple time zones.
Apply now

More job openings