[Remote] Site Reliability Engineer (SRE) — AI Training & Inference Infrastructure

Remote, USA Full-time
Note: The job is a remote job and is open to candidates in USA. STACK Construction Technologies builds software that helps teams plan, build, and operate with clarity and speed. They are seeking a Site Reliability Engineer to own reliability for their model training and inference platforms, focusing on operating and evolving GPU-enabled clusters and improving developer experience for AI workloads. Responsibilities • Build and operate AI compute platforms • Design, provision, and scale GPU-backed clusters for training and inference (Kubernetes-based and/or HPC-style schedulers) • Own cluster lifecycle management: provisioning, bootstrapping, upgrades, autoscaling/capacity scaling, and decommissioning • Build reliable abstractions so training jobs can run across multiple clusters/environments with minimal friction • Define and track SLIs/SLOs for training and inference systems (job success rate, queue latency, throughput, tail latency, GPU utilization, etc.) • Lead incident response and root-cause analysis; drive permanent fixes and “never again” automation • Improve recovery and maintenance workflows (e.g., reducing restart/upgrade times; safer rollouts) • Implement end-to-end monitoring across compute, networking, storage, and accelerators • Build dashboards, alerting, and anomaly detection that catch issues early—before they derail long runs • Tune performance and cost: GPU utilization, scheduling efficiency, I/O bottlenecks, and network hotspots • Partner with vendors and internal stakeholders on firmware/driver alignment, and node health • Provide paved paths for training: reproducible environments, job templates, secure secrets, artifact storage, and dataset access patterns • Collaborate closely with ML researchers/engineers to understand workload needs and remove infrastructure bottlenecks Skills • 5+ years building/operating production infrastructure as an SRE, infrastructure engineer, or systems engineer • Strong Kubernetes experience (cluster operations, upgrades, networking, storage, and troubleshooting) • Proficiency in at least one programming/scripting language (Python, Go, etc.) for automation and tooling • Experience with Infrastructure-as-Code (Terraform preferred) and CI/CD for infra or platform components • Solid Linux/Unix fundamentals (performance, debugging, kernel/userland tooling) • Strong operational mindset: you care about reliability, safe change management, and measurable outcomes Company Overview • Stack Construction Technologies is a construction company. It was founded in 2010, and is headquartered in Cincinnati, Ohio, USA, with a workforce of 51-200 employees. Its website is Apply tot his job
Apply Now

Similar Jobs

[Remote] Site Reliability Engineer II - CTJ - Top Secret

Remote, USA Full-time

FedRAMP Site reliability Engineer (Remote - USA)

Remote, USA Full-time

[Hiring] EY Parthenon Strategy Senior / Manager – Smart Cities @EY

Remote, USA Full-time

Social Media – Brand Manager

Remote, USA Full-time

Freelance Social Media Strategist

Remote, USA Full-time

Social Media Manager / Marketing Assistant (Senior)

Remote, USA Full-time

Social Media Manager (Content and Growth) – Remote

Remote, USA Full-time

Social Media Manager Work From Home

Remote, USA Full-time

Social Media Account Manager

Remote, USA Full-time

[Remote] Senior Software Engineer- Sharing Foundations

Remote, USA Full-time

Experienced Remote Customer Service Representative - Delivering Exceptional Support to blithequark's Global Clientele

Remote, USA Full-time

Experienced Part-Time Data Entry Specialist for Ambitious Students – Remote Work Opportunity with Flexible Hours and Competitive Compensation

Remote, USA Full-time

Experienced Full Stack Call Center Chat Specialist – Client Solutions Advisory and Digital Banking Support

Remote, USA Full-time

Bilingual Insurance P&C Agent - Remote

Remote, USA Full-time

Experienced Remote Customer Service Agent for Travel Services – Career Growth and Flexible Work Arrangements

Remote, USA Full-time

Experienced Remote Customer Service Representative – Delivering Exceptional Experiences for Apple Users Worldwide

Remote, USA Full-time

Experienced Remote Data Entry Representatives Wanted for arenaflex - Earn $20-$25 Per Hour with Flexible Scheduling

Remote, USA Full-time

Require Administrative Support Specialist - Academics and Student Learning - CO in Ohio

Remote, USA Full-time

Recovery Analyst Underpayments

Remote, USA Full-time

Cloud Engineer Microsoft Azure

Remote, USA Full-time
Back to Home