DevOps + MLOps Engineer (GPU Workloads, AWS, Production Pipelines)

Remote, USA Full-time
We are hiring a DevOps and MLOps Engineer to help us build and operate a production-grade cloud setup for an AI-heavy application. This role is hands-on and execution focused. You will own infrastructure, deployment pipelines, observability, arenaflex controls, and GPU workload operations. We will share full product details and architecture on a call. For now, assume the platform includes a web app, backend services, media storage, async processing workers, and AI integrations (LLM, TTS, STT) plus GPU-based generation workloads. What you will do Design and implement cloud infrastructure on AWS for a modern backend stack. Set up arenaflex/CD for multiple services and environments (dev, staging, production). Build an event-driven processing system using queues and worker pools. Operate GPU workloads end-to-end including provisioning, scheduling, scaling, and arenaflex control. Implement monitoring, alerting, and dashboards for API latency, queue depth, worker health, GPU utilization, failure rates, and spend. Create secure access patterns for secrets and data encryption. Define operational runbooks, incident response, and reliability playbooks. Help the team ship fast without breaking production, with clear guardrails and measurable SLOs. Required experience (must have) You have run GPU workloads in production, not just experiments. Hands-on with GPU providers such as RunPod and at least one of CoreWeave or Salad (Lambda or Vast also acceptable), including: spinning up GPU instances packaging and deploying GPU services managing concurrency autoscaling strategies handling preemption and failures monitoring GPU health and utilization hard arenaflex caps and budget guardrails Strong AWS fundamentals including IAM, VPC, S3, CloudWatch, Secrets Manager, KMS, and AWS Budgets. Solid Docker experience and production arenaflex/CD setup. Infrastructure as Code experience, Terraform preferred. Comfortable setting up queues, background workers, and async pipelines. Strong security mindset and ability to implement least privilege and audit trails. Nice to have Kubernetes GPU scheduling experience, or deep ECS/Fargate patterns. Experience building arenaflex meters per job or per request in AI systems. Experience with ML lifecycle tooling like MLflow or Weights and Biases. Experience with streaming and real-time pipelines. Deliverables in the first 2 to 4 weeks Working AWS environments (dev/staging/prod) with secure networking and access controls. arenaflex/CD pipelines that deploy backend and workers reliably. Queue + worker infrastructure with autoscaling policies. GPU execution setup on RunPod and a second provider (CoreWeave or Salad preferred) with monitoring and fallback strategy. Observability dashboards and alerting with clear runbooks. arenaflex controls and spend visibility by component. How we work Short sprints with frequent demos. Clear scope and strong ownership. You will work closely with engineering and product. To apply, include A short description of your most recent production GPU workload: provider used, GPU type, workload type (inference/rendering), concurrency, scaling approach, failure handling, monitoring, and monthly spend range. Links or examples of infrastructure work you have done (GitHub, writeups, diagrams, or sanitized screenshots are fine). Screening questions Which GPU providers have you used in production, and for what workloads? What was your approach to scaling and arenaflex caps during peak traffic? How do you handle GPU job failures, retries, and preemption safely? What’s your preferred AWS stack for queues, workers, secrets, and monitoring? If you look strong on paper, we will do a short call to share context and validate fit quickly. Important Notice: Please do not message Tim directly. All applications and questions must be sent to Ahmed, the hiring lead, through this Upwork job post and message thread only. Anyone who reaches out via LinkedIn or any other third party channel will be rejected and reported on Upwork. Apply tot his job Apply tot his job
Apply Now

Similar Jobs

App Developer - FlutterFlow Specialist

Remote, USA Full-time

Mobile Application Developer – iOS

Remote, USA Full-time

Mobile Application Developer – Android

Remote, USA Full-time

Mobile Developer (Temporary – 4 to 6 months)

Remote, USA Full-time

Senior Product Manager, Mobile – US (Remote)

Remote, USA Full-time

Product Manager, Mobile

Remote, USA Full-time

Sr Manager, Care/Retail Product Management

Remote, USA Full-time

Manager, Product & Engineering

Remote, USA Full-time

Remote Senior Product Manager – Mobile Experience, Customer Engagement & Analytics for Ring Smart Home App (Work‑From‑Home)

Remote, USA Full-time

IAM Security Engineer (W2 Only)

Remote, USA Full-time

**Experienced Customer Service Representative - Work from Home Opportunity at blithequark**

Remote, USA Full-time

LAW CLERK FULLY REMOTE POSITION/LAW STUDENTS ENCOURAGED TO APPLY

Remote, USA Full-time

Disney Jobs(Data Entry, No Experience), Disney Entry Level Remote Jobs, Disney Health Jobs

Remote, USA Full-time

Care Connect Case Manager, (REMOTE) -Oncology

Remote, USA Full-time

Experienced Part-Time Evening Remote Data Entry Clerk – WFH with Flexible Schedule and Growth Opportunities at Blithequark

Remote, USA Full-time

Mainframe z/OS Systems Programmer - DevCare Solutions

Remote, USA Full-time

Finance Controller: Americas Funds

Remote, USA Full-time

Experienced Part-Time Remote Data Entry Specialist for Global Streaming Platform – Content Metadata Management and Quality Assurance

Remote, USA Full-time

Outbound Sales Call Center Agent (NZ/AUS Market)

Remote, USA Full-time

Aircraft Maintenance Technician - Line RON - SAN

Remote, USA Full-time
Back to Home