Create Email Alert

Email Alert for

ⓘ There was an unexpected error processing your request.

Please refresh the page and try again.

If the problem persists, please contact us with your issue.

Email address is already registered

You can always manage your preferences and update your interests to ensure you receive the most relevant opportunities.

Would you like to [visit your alert settings] now?

Success! You're now signed up for Job Alerts

Get ready to discover your next great opportunity.

Similar Jobs

  • Meta

    Software Engineer, SystemML - Scaling / Performance_

    Menlo Park

    **Summary:** In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated int

    Job Source: Meta
  • Intel

    Software Performance Engineer

    San Jose, CA, United States

    • Ending Soon

    Job Details: Job Description: The Intel Software and Advanced Technology Group's Microsoft Technology Center (MTC) delivers innovative, high quality software that differentiates Intel products and maximizes customer value. MTC's mission is to ensure Microsoft operating environments, tools, and key applications run best on Intel architecture and

    Job Source: Intel
  • NVIDIA

    Performance Software Engineer

    Santa Clara, CA, United States

    • Ending Soon

    We are now looking for a Performance Software Engineer for Deep Learning Libraries! Do you enjoy tuning parallel algorithms and analyzing their performance? If so, we want to hear from you! As a deep learning library performance software engineer, you will be developing optimized code to accelerate linear algebra and deep learning operations on NV

    Job Source: NVIDIA
  • Intel

    Software Performance Engineer

    San Jose, CA, United States

    • Ending Soon

    Job Details: Job Description:  The Intel Software and Advanced Technology Group's Microsoft Technology Center (MTC) delivers innovative, high quality software that differentiates Intel products and maximizes customer value. MTC's mission is to ensure Microsoft operating environments, tools, and key applications run best on Intel architecture a

    Job Source: Intel
  • Intel

    Software Performance Engineer

    San Jose, CA, United States

    Job Details: Job Description: The Azure Solutions Engineering team within MTC at Intel has an opportunity available for a Senior Engineer to join our team and make significant impact on the bleeding edge Intel Hardware to be used by Microsoft Azure. You will work closely with Intel technology leaders and external partners to help optimize and

    Job Source: Intel
  • Zoox

    Software Performance Engineer

    San Mateo, CA, United States

    Zoox is building advanced self-driving hardware and software solutions. To attain the utmost efficiency that the system demands, we need you - an expert who understands both compute hardware architecture as well as the algorithms and middleware that run on it. Your contributions will be instrumental in achieving optimal power levels while maximizin

    Job Source: Zoox
  • Zoox

    Software Performance Engineer

    Foster City, CA, United States

    • Ending Soon

    Zoox is building advanced self-driving hardware and software solutions. To attain the utmost efficiency that the system demands, we need you - an expert who understands both compute hardware architecture as well as the algorithms and middleware that run on it. Your contributions will be instrumental in achieving optimal power levels while maximizin

    Job Source: Zoox
  • Zoox

    Software Performance Engineer

    Foster City, CA, United States

    • Ending Soon

    Zoox is building advanced self-driving hardware and software solutions. To attain the utmost efficiency that the system demands, we need you - an expert who understands both compute hardware architecture as well as the algorithms and middleware that run on it. Your contributions will be instrumental in achieving optimal power levels while maximizin

    Job Source: Zoox

Software Engineer, SystemML - Scaling / Performance

Menlo Park, CA, United States

In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns. At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale GPU training and inference fleet through an observable, reliable and high-performance distributed AI/GPU communication stack. Currently, one of the team's focus is on building customized features, SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale GenAI/LLM training) from the trainer down to the inter-GPU and network communication layer. And we are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance.

Software Engineer, SystemML - Scaling / Performance Responsibilities

Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling

Minimum Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.

Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).

Preferred Qualifications

PhD in Computer Science, Computer Engineering, or relevant technical field

Experience with NCCL and distributed GPU reliability/performance improvment on RoCE/Infiniband

Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow

Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel

Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models

Experience in HPC and parallel computing

Knowledge of GPU architectures and CUDA programming

Knowledge of ML, deep learning and LLM

Start preparing

Learn about how to prepare for your interview with our interview guide, tips, and interactive experiences.

Visit interview prep

#J-18808-Ljbffr

Apply

Create Email Alert

Create Email Alert

Email Alert for Software Engineer, SystemML - Scaling / Performance jobs in Menlo Park, CA, United States

ⓘ There was an unexpected error processing your request.

Please refresh the page and try again.

If the problem persists, please contact us with your issue.

Email address is already registered

You can always manage your preferences and update your interests to ensure you receive the most relevant opportunities.

Would you like to [visit your alert settings] now?

Success! You're now signed up for Job Alerts

Get ready to discover your next great opportunity.