Deepmind

Senior Software Engineer, Large Scale Infrastructure, Platform

Job Description

Posted on: 
January 11, 2023

Your job is to work on developing the supercomputing infrastructure that maps the execution of these experiments into our supercomputers efficiently and reliably. There are three key areas of focus: scalability of the overall distributed system, failover management, and performance optimisation. You will analyse and improve existing systems but also propose and implement new components and whole systems. This work spans the entire software stack, from the research code (JAX) on top, to the distributed runtime coordinating the overall supercomputer system, all the way down to the accelerator runtime. This includes partitioning and distribution of JAX and Python computations, managing and recovering from node failures, efficient communication across nodes, compilation stack for accelerator computations, performance optimisation, etc.

In addition to building on top of existing hardware, your job is also to provide feedback in future generations of our hardware systems. This work is done in close collaboration with various teams across Google and Alphabet.

Responsibilities

  • Correctness, robustness, and performance analysis of AI workloads, correlating functional errors or system inefficiencies with bugs or missed optimisation opportunities.
  • Designing and implementing system improvements to address scaling or performance inefficiencies.
  • Designing and implementing new components and systems that improve the reliability / scalability and performance of our future software stack.
  • Thought leadership by driving the future direction of our supercomputer infrastructure and directly influencing researchers to shape up their ML models in order to scale and adapt to future infrastructure and hardware capabilities.
  • Mentoring and leading junior engineers. No expectations to take on formal people management responsibilities.
  • Close collaboration and coordination with teams across Alphabet and external partners.
  • Active participation in future generations of our hardware systems.

Job Requirements

  • SWE interpersonal skills, such as discussing technical ideas effectively with colleagues, e.g. through whiteboard, design docs, presentations, etc. in a highly collaborative environment.
  • Experience in building software tools that require leveraging low-level OS kernel functionality (e.g. a virtual file system, a memory management system, etc.)
  • Experience in working with compilers and/or program transformation, preferably aiming at improving the performance and/or reducing memory usage for CPU and/or ML accelerator workloads.
  • Programming hardware systems to achieve high levels of throughput and efficiency.
  • Profiling and benchmarking software systems using appropriate tools and techniques to find performance issues & bottlenecks.
  • Experience in scaling a large complex, potentially distributed system that involves hundreds or thousands individual tasks/processes.
  • You have an interest in our mission and AI / Machine Learning.

In addition, the following would be an advantage:

  • Bachelor’s, Master or Ph.D. degree in Computer Science, a related technical field, or equivalent practical experience.
  • Experience with large systems (software and HW) design and development.
  • Experience with programming in modern C++ and Python.
  • Experience with GPUs or other hardware accelerators.
  • Experience with compilers such as LLVM, MLIR, XLA.
  • Experience with parallel programming and high-performance computing.
  • Familiarity with the key components of machine learning algorithms (e.g. Stochastic Gradient Descent)
  • Interest in AI and basic knowledge of ML approaches (supervised learning, reinforcement learning, unsupervised learning, etc).

Apply now

More job openings