Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

OUtTb09BNWpzTU5ObzJIbVVmbWlOMGd2Z1E9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

Metro One The Loss Prevention Group, Inc.

Unarmed Security Officer Job at Metro One The Loss Prevention Group, Inc.

 ...career, not just find another job? Metro One Loss Prevention Services Group has the opportunityyouvebeen looking for! About Us:At Metro One LPSG, we are reshaping the security industry with a dynamic, servi Security Officer, Armed Security, Unarmed Security, Officer... 

VLASIC LABS LLC

West Michigan Area Brand Ambassador Job at VLASIC LABS LLC

 .../Perks Competitive Compensation Career Growth Opportunities Job Summary We are seeking an outgoing and energetic Brand Ambassador to join our team! In this role, you will promote our products and act as the face of our brand. Your responsibilities will include... 

Alta Cima Corp.

New Home Housing Consultant (Sales Associate) Job at Alta Cima Corp.

 ...team player. Have fun while providing a great experience for the customer. Minimum Qualifications: Tech-savvy with Microsoft Office & CRM experience Sales License may be required or must be successfully completed within 90 days. Solid verbal and writing... 

Marathon Staffing

Inventory Stocker - Up to $17/hr - START this WEEK (BMS) Job at Marathon Staffing

 ...roles within similar geographic areas. Although job titles and job descriptions may overlap, pay rates, job locations, shifts, and nature of employment can vary depending on the needs of our employer partners. Marathon is an Equal Opportunity Employer BMS

Summit Recruiting Group

Oncology - Radiation Physician Job at Summit Recruiting Group

 ...Oncology - Radiation Physician at Summit Recruiting Group summary: Seeking a full-time Board Certified/Eligible Radiation Oncologist to join a multidisciplinary cancer care team in Elmira, NY. The role involves providing advanced outpatient radiation therapy using...