Senior Prompt and Benchmark Engineer, Evaluation of World Models

Remote, USA Full-time

At NVIDIA, we're not just building the future, we're generating it! Our Cosmos generative AI engineering team is pushing the boundaries of what’s possible across multimodal learning, video generation, synthetic data, intelligent simulation, and agentic systems. We are looking for exceptionally driven engineers and applied scientists with deep experience in generative modeling to help improve the quality of our world models! What You’ll Be Doing • Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments. • Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models. • Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows. • Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus. • Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption. • Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases. • Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability. • Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats. Note: This is a 100% evaluation-focused role. You do not need to build pipelines or engineering infrastructure! You will be supported by engineers who will implement the systems around your designs. What We Need To See • Demonstrated experience with prompt engineering, including crafting, refining, and optimizing prompts. • Strong attention to detail in designing natural language questions and formatting structured evaluations. • Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments. • Experience crafting or contributing to benchmarks or evaluation datasets, especially for multimodal or agentic systems. • Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families. • Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design. • A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows. • 10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields. • BS, MS, or equivalent background. Prior experience in AI evaluation, annotation workflows, or research is highly valued. Ways To Stand Out From The Crowd • Hands-on experience with multiple LLMs or VLMs (e.g., GPT, Claude, Gemini, Flamingo, Kosmos, IDEFICS, etc.) to compare outputs and engineer task-specific prompts. • Prior work designing benchmarks for robotics, simulation, AV, or agentic tasks, especially in multimodal or video-based settings. • Experience working with human annotation teams, building clear instructions and QA processes for large-scale labeling campaigns. • Familiarity with using VLMs as evaluators, leveraging models for response scoring, ranking, or consensus aggregation. • Deep curiosity about model behavior and a drive to test, interrogate, and stretch the limits of generative systems. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits . Applications for this job will be accepted at least until November 22, 2025.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. JR2008327 Apply tot his job

Apply Now

Experienced Part-Time Teleradiology Specialist – Flexible 1st or 2nd Shift Work-from-Home Opportunities with Competitive Compensation and Benefits

Remote, USA Full-time

R&D Packaging Scientist 1 (Recent Grad Starting in 2026)

Remote, USA Full-time

Sr/ Bioinformatics Analyst /REMOTE/

Remote, USA Full-time

Back to Home

Senior Prompt and Benchmark Engineer, Evaluation of World Models

Similar Jobs

Remote Proofreading and Editing Jobs Available

Remote Copy Editor | Entry Level & Part-Time Editing Jobs

Senior QA Specialist (Med Device CAPA/EU) REMOTE

Director, Global Regulatory Strategy and Policy Advisor - Job ID: GRSPA

Advanced QA Automation Engineer - 6775658

Property Accountant

Hybrid Property Accountant | Remote | Accountant Role

Fractional Real Estate Controller (Monthly Close + Reconciliations)

Part-time Accountant

Service Professionals Transitioning into Real Estate

Senior Data Consultant - Energy Consultant

Email Marketing Specialist – Spirituality & Meditation Business

Veterinary Technician - Upper East Side

Associate Clinical Specialist, CRM - Mountain View

Care Navigator - Remote Schedule Available - Behavioral Health Industry - Customer Service and Patient Care Coordination

Experienced Part-Time Data Entry Specialist – Remote Work Opportunity for Detail-Oriented Individuals

Experienced Healthcare Customer Service Representative - Work from Home Opportunity at arenaflex

Experienced Part-Time Teleradiology Specialist – Flexible 1st or 2nd Shift Work-from-Home Opportunities with Competitive Compensation and Benefits

R&D Packaging Scientist 1 (Recent Grad Starting in 2026)

Sr/ Bioinformatics Analyst /REMOTE/

Senior Prompt and Benchmark Engineer, Evaluation of World Models

Similar Jobs

Remote Proofreading and Editing Jobs Available

Remote Copy Editor | Entry Level & Part-Time Editing Jobs

Senior QA Specialist (Med Device CAPA/EU) REMOTE

Director, Global Regulatory Strategy and Policy Advisor - Job ID: GRSPA

Advanced QA Automation Engineer - 6775658

Property Accountant

Hybrid Property Accountant | Remote | Accountant Role

Fractional Real Estate Controller (Monthly Close + Reconciliations)

Part-time Accountant

Service Professionals Transitioning into Real Estate

Senior Data Consultant - Energy Consultant

Email Marketing Specialist – Spirituality & Meditation Business

Veterinary Technician - Upper East Side

Associate Clinical Specialist, CRM - Mountain View

Care Navigator - Remote Schedule Available - Behavioral Health Industry - Customer Service and Patient Care Coordination

Experienced Part-Time Data Entry Specialist – Remote Work Opportunity for Detail-Oriented Individuals

**Experienced Healthcare Customer Service Representative - Work from Home Opportunity at arenaflex**

**Experienced Part-Time Teleradiology Specialist – Flexible 1st or 2nd Shift Work-from-Home Opportunities with Competitive Compensation and Benefits**

R&D Packaging Scientist 1 (Recent Grad Starting in 2026)

Sr/ Bioinformatics Analyst /REMOTE/

Experienced Healthcare Customer Service Representative - Work from Home Opportunity at arenaflex

Experienced Part-Time Teleradiology Specialist – Flexible 1st or 2nd Shift Work-from-Home Opportunities with Competitive Compensation and Benefits