QA Engineer - Load Testing Specialist (2 months contract)
Position Overview Monolith AI is seeking an experienced QA Engineer to lead load testing efforts for a critical system release focused on improving concurrency and high request load handling. This fast-paced, short- term engagement requires someone who can quickly understand complex distributed systems, design comprehensive load tests, and work collaboratively with a rapidly growing engineering team to ensure our new environment meets performance requirements. Primary Responsibilities • Design and Implement Automated Load Testing Framework ◦ Develop comprehensive load tests for FastAPI endpoints, Temporal workflows/ activities, and AWS service interactions ◦ Create realistic test scenarios simulating concurrent workflow execution patterns, including graph-based workflow orchestration ◦ Build automated test suites that measure system behavior under varying concurrency levels and request loads • Performance Analysis and Bottleneck Identification ◦ Monitor and analyze system performance across the entire stack (API layer, Temporal workers, AWS services) ◦ Identify concurrency limitations in Temporal workflow execution, AWS service limits (Athena, ECS), and inter-component communication ◦ Document performance characteristics including response times, throughput limits, and failure modes under load • Collaborate on Non-Functional Requirements (NFR) Definition ◦ Work with Customer Success and Product teams to understand business requirements and translate them into measurable performance criteria ◦ Iterate on acceptable concurrency thresholds, latency targets, and throughput requirements◦ Validate that proposed NFRs are realistic and achievable given architectural constraints • System Documentation and Knowledge Extraction ◦ Understanding of the existing system through code review, discussions with the development team, and exploratory testing ◦ Create clear documentation of test methodologies, results, and recommendations for future testing • Recommendation and Optimization Guidance ◦ Provide actionable recommendations for removing identified bottlenecks ◦ Suggest configuration optimizations for Temporal (worker pools, task queues) and AWS services (Athena concurrency, ECS capacity) • Rapid Communication and Status Reporting ◦ Maintain daily/frequent communication with the Tech Lead regarding project progress, blockers, and findings ◦ Quickly escalate issues that could impact the aggressive timeline ◦ Present findings and recommendations to technical and non-technical stakeholders • Cross-Component Integration Testing ◦ Test complex scenarios involving graph execution triggering node workflows across multiple system boundaries ◦ Validate S3 read/write operations under concurrent load ◦ Ensure inter-component communication (API → Temporal, Temporal Activity → API triggers) performs reliably at scale Key Performance Indicators • Test Coverage and Execution ◦ Complete automated load test suite covering all critical components within first 3 weeks ◦ Execute baseline and progressive load tests identifying maximum sustainable concurrency levels • Bottleneck Identification and Impact ◦ Identify and document top 5-7 performance bottlenecks with clear impact analysis ◦ Provide actionable remediation recommendations with estimated effort and impact for each bottleneck 3. NFR Definition and Validation ◦ Collaborate with stakeholders to define measurable NFRs within first 2 weeks ◦ Validate system meets or document gaps against agreed NFR criteria by project end • Documentation and Knowledge Transfer ◦ Deliver comprehensive test documentation, results analysis, and system performance characteristics ◦ Conduct knowledge transfer sessions ensuring team can maintain and extend testing framework • Project Velocity and Communication ◦ Meet weekly milestone targets in this fast-paced 2-month engagement ◦ Maintain proactive communication rhythm (daily standups, weekly detailed reports to Tech Lead) Required Qualifications Experience: • 4+ years of experience in QA/performance testing roles • 2+ years of hands-on experience with load testing distributed systems and microservices architectures • Proven experience with load testing tools (e.g., k6, JMeter, Locust, Gatling, Artillery) • Experience testing workflow orchestration systems (Temporal, Airflow, Prefect, or similar) • Demonstrated ability to test systems integrating with AWS services (particularly Athena, ECS, S3) Technical Skills: • Strong proficiency in Python (required for test automation and working with FastAPI/ Temporal) • Experience with REST API testing and performance validation • Understanding of distributed systems concepts: concurrency, queueing, backpressure, rate limiting • Familiarity with AWS infrastructure and service limits• Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, or similar) • Proficiency with Git and CI/CD pipelines • Ability to read and understand code in order to design effective tests Immediate Availability: • Ability to start in early January 2025 and commit to focused 3-month engagement • Availability for full-time contract work during project duration Preferred Qualifications • Direct experience with (workflows, activities, workers) • Experience with containerized workloads and Docker/ECS • Prior work in fast-paced startup or scale-up environments • Experience with infrastructure-as-code (Terraform, CloudFormation) • Background in Site Reliability Engineering (SRE) or DevOps practices • Familiarity with data processing pipelines and analytics systems • Previous contract/consulting experience with rapid knowledge acquisition • Experience with graph-based workflow systems or DAG execution engines • Knowledge of AWS service limits and optimization strategies Essential Soft Skills Self-Direction and Initiative: • Ability to operate independently in an ambiguous, fast-moving environment with minimal documentation • Proactive problem-solving mindset; doesn't wait for perfect information before taking action • Comfortable making pragmatic decisions quickly in a time-constrained project Communication and Collaboration: • Exceptional communication skills for extracting knowledge through conversations with existing team members • Ability to translate technical findings into clear, actionable recommendations for diverse audiences• Comfortable asking clarifying questions and challenging assumptions respectfully • Strong written communication for documentation and status updates Adaptability and Learning Agility: • Quick learner who can rapidly understand complex, poorly documented systems • Flexible and comfortable with changing priorities in a 15-person team that's doubling in size • Thrives in fast-paced environments with aggressive timelines Pragmatism and Results Orientation: • Focused on delivering practical, actionable outcomes within tight timeframes • Understands the balance between thoroughness and speed in a 2-month engagement • Comfortable with "good enough" when perfect isn't achievable within constraints Stakeholder Management: • Skilled at managing expectations with technical leadership about realistic timelines and trade-offs • Diplomatic when delivering difficult news about performance limitations or bottlenecks • Collaborative approach when working with CS and Product on NFR definition Key Challenges in This Role • Rapid Knowledge Acquisition with Limited Documentation ◦ The existing system lacks comprehensive documentation, requiring you to quickly build understanding through code review, system exploration, and frequent discussions with the development team ◦ Success requires comfort with ambiguity and strong investigative skills • Aggressive Timeline with High Impact ◦ A 3-month timeline to design tests, execute comprehensive load testing, identify bottlenecks, and deliver actionable recommendations is extremely tight ◦ Must balance thoroughness with pragmatism; prioritize ruthlessly to ensure critical areas are covered • Complex Distributed System with Multiple Integration Points ◦ The system involves multiple layers (FastAPI, Temporal, AWS services) with complex inter-component communication patterns (graph → node workflows)◦ Must understand the entire stack sufficiently to design realistic, comprehensive load tests that expose real-world bottlenecks Apply tot his job