MLOps Lead, Central Technology
About the position Responsibilities • Provide technical MLOps leadership for a team of MLOps Engineers, managing and leading the team in operating AI training and inference systems. • Drive the application of MLOps and DevOps principles across multiple platforms, ensuring peak operational efficiency. • Define end to end metrics program including full proactive monitoring and alerting systems for the MLOps team. • Facilitate model training through collaboration with AI Researchers to ensure best practices in machine learning and deep learning. • Optimize Kubernetes based AI Lifecycle platform through IAC practices and integrate with On-Prem HPC systems. • Collaborate on Data systems for AI model training with Data Infrastructure Eng team and Science data teams. • Lead MLOps team supporting on-call rotation with a focus on automation and proactive alerting. Requirements • BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience. • 7+ years of relevant coding and systems experience. • 5+ years of systems Architecture and Design experience, with a broad range of MLOps experience. • Proven technical leadership in SRE and MLOps related experience. • Strong experience scaling containerized applications on Kubernetes or Mesos. • Cloud Platform proficiency with AWS, GCP, or Microsoft Azure. • MLOps experience working with medium to large scale GPU clusters in Kubernetes. • Working knowledge of Nvidia CUDA and AI/ML custom libraries. • Knowledge of Linux systems optimization and administration. • Solid Coding experience with a systems language such as Rust, C/C++, C#, Go, Java, or Scala. • Expertise with a scripting language such as Python, PHP, or Ruby. • Experience in integrating Data with the AI Lifecycle. • AI/ML Platform Operations experience in an environment integrated with challenging data and systems platform challenges. • Large scale Streaming data systems integration experience. • Experience with Hadoop, Spark, and/or Kafka deployments. • Workflow scheduling tools experience such as Apache Airflow, Dagster, or Apache Beam. • Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms. Nice-to-haves • Experience with PyTorch, Keras, or Tensorflow. • Experience with HPC and Slurm. Benefits • Generous employer match on employee 401(k) contributions. • Annual benefit for employees that can be used for housing, student loan repayment, childcare, commuter costs, or other life needs. • CZI Life of Service Gifts awarded to employees to support causes closest to them. • Paid time off to volunteer at an organization of your choice. • Funding for select family-forming benefits. • Relocation support for employees moving to the Bay Area. Apply tot his job