Lead Site Reliability Engineer, Observability - Remote

Remote, USA Full-time
About the position Responsibilities • Design, deploy and scale our Prometheus architecture to handle 100+ million active series and beyond. • Deploy and operate large, high-performance ElasticSearch clusters holding 2000+TB of data. • Deploy and grow high-throughput data pipelines built on Kafka, handling hundreds of thousands of events per second. • Design and build an alerting system that allows engineering teams to construct alerts from multiple data sources and alerting workflows. • Write libraries and APIs that give engineers self-service access to our monitoring, logging, and other observability systems. • Use Terraform to deploy public and private cloud infrastructure. Requirements • 5+ years experience designing, deploying and operating mid to large size distributed systems on VMs or bare metal machines running Linux (we run Debian and Ubuntu). • 2+ years experience developing with languages like Ruby, Python, Go, Scala, or Bash. • Excited by the challenge of solving difficult problems in large distributed systems that deal with huge amounts of data. • Desire to work on a highly autonomous team that cares deeply about quality and customer experience. • Curious, learn fast and feel comfortable diving into unfamiliar code and systems to solve problems. • Understand the value of observability and can work with other teams to help them better monitor their services. • Willing to be part of a production on-call rotation. • Direct experience with technologies such as Elasticsearch Logstash Kibana (ELK) stack, Kafka, Prometheus/Thanos/Cortex, Graphite, Ansible, Terraform, Consul. • Strong experience in building out solutions based on Software engineering best practices. Benefits • Quality medical, dental and vision insurance. • 401(k) plan with a Cisco matching contribution. • Short and long-term disability coverage. • Basic life insurance. • Numerous wellbeing offerings. • Up to twelve paid holidays per calendar year, including one floating holiday. • Paid time off for birthdays. • Vacation time off policy with flexible limits for exempt employees. • Sick time off policy with 80 hours provided on hire date and annually thereafter. • Paid time to volunteer and give back to the community. Apply tot his job
Apply Now

Similar Jobs

[Remote] Software Engineer - Customer Experience Engineering

Remote, USA Full-time

Snowflake Data Engineer – Remote

Remote, USA Full-time

[Hiring] EY Parthenon Strategy Senior / Manager – Smart Cities @EY

Remote, USA Full-time

Hiring Now: Senior Snowflake Database Engineer - Delta Dental of

Remote, USA Full-time

Social Media Evaluator – Remote Online Work

Remote, USA Full-time

Part-Time Virtual Assistant (Social Media & Content Management)

Remote, USA Full-time

[Remote] Sr. Social Media Strategist, Temporary

Remote, USA Full-time

Freelance Social Media & Paid Digital Ads Manager

Remote, USA Full-time

Social Media Ads Specialist

Remote, USA Full-time

bolthires Social Media Support Jobs – Work From Home

Remote, USA Full-time

Financial Counselor (part-time, flexible work schedule)

Remote, USA Full-time

Multimedia Producer / Staff Writer

Remote, USA Full-time

Experienced Customer Service Representative - Remote Opportunity with Competitive Hourly Rate

Remote, USA Full-time

[Remote] National Account Manager - Japanese, Korean or Chinese Bilingual-NJ/NY location

Remote, USA Full-time

Experienced Ulta Beauty Consultant – Delivering Exceptional Guest Experiences in Beauty Retail at Target

Remote, USA Full-time

Director, Sales Worldwide Accounts (Strategic Accounts)

Remote, USA Full-time

Digital Therapeutics (DTx) Research Scientist | Register for R&D/Innovation Consulting Career

Remote, USA Full-time

**Part-Time Customer Support Representative – Join blithequark's Dynamic Team on Saturdays**

Remote, USA Full-time

Specialist, Strategic Communications Remote / Telecommute Jobs

Remote, USA Full-time

Customer Service Agent - IND (Part-Time)

Remote, USA Full-time
Back to Home