[Remote] Research Engineer (Machine Learning)
Note: The job is a remote job and is open to candidates in USA. Aldea is a multi-modal foundational AI company focused on advancing the scaling laws of intelligence. The Research Engineer (Machine Learning) will build and optimize the infrastructure for multi-modal AI research, enabling the team to experiment with next-generation architectures in language and speech domains. Responsibilities • Build and maintain distributed training infrastructure supporting researchers across language and speech domains at a billion-plus-parameter scale. • Optimize training and inference performance across the stack, delivering significant speedups through framework optimization, custom kernels, and system-level improvements. • Design experiment infrastructure including automated evaluation pipelines, experiment tracking, and monitoring systems that enable rapid iteration. • Scale infrastructure from single-node to multi-node distributed training and deploy production inference systems for real-time applications. • Support researchers with fast turnaround on infrastructure issues and maintain high reliability across all systems. • Collaborate with research scientists, data engineers, and leadership to define technical priorities and infrastructure roadmap. Skills • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience. • 3+ years of experience with PyTorch and distributed training frameworks (DDP, FSDP, DeepSpeed, or similar). • Experience training large-scale deep learning models at 1B+ parameters. • Deep understanding of training optimization techniques including mixed precision, gradient checkpointing, and memory management. • Proven ability to build production-grade ML infrastructure with high reliability. • Track record of delivering significant performance optimizations in ML training or inference systems. • Experience with custom kernel development (CUDA, Triton) or GPU optimization. • Hands-on experience with large-scale pretraining (100B+ tokens, ideally trillion+ scale). • Experience optimizing inference for production: quantization, vLLM, TensorRT, or custom serving engines. • Familiarity with speech/audio ML systems and real-time inference constraints. • Experience building automated evaluation frameworks and experiment tracking systems. • Knowledge of profiling tools and multi-node training across 8-32+ GPUs. • Exposure to job orchestration systems (SLURM, Kubernetes, Ray). • Master's or PhD in Computer Science, Machine Learning, or related field. Benefits • Competitive base salary • Performance-based bonus aligned with research and model milestones • Equity participation • Comprehensive health, dental, and vision coverage • Flexible paid time off Company Overview • Aldea is a foundational AI company building next-generation voice and language models that power how people and software communicate. It was founded in undefined, and is headquartered in Miami, FL, US, with a workforce of 11-50 employees. Its website is Apply tot his job