Create Email Alert

Email Alert for

ⓘ There was an unexpected error processing your request.

Please refresh the page and try again.

If the problem persists, please contact us with your issue.

Email address is already registered

You can always manage your preferences and update your interests to ensure you receive the most relevant opportunities.

Success! You're now signed up for Job Alerts

Get ready to discover your next great opportunity.

Similar Jobs

Meta

Software Engineer, SystemML - Scaling / Performance_

Menlo Park
**Summary:** In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated int
Job Source: Meta
Intel

Software Performance Engineer

San Jose, CA, United States
- Ending Soon
Job Details: Job Description: The Intel Software and Advanced Technology Group's Microsoft Technology Center (MTC) delivers innovative, high quality software that differentiates Intel products and maximizes customer value. MTC's mission is to ensure Microsoft operating environments, tools, and key applications run best on Intel architecture and
Job Source: Intel
NVIDIA

Performance Software Engineer

Santa Clara, CA, United States
- Ending Soon
We are now looking for a Performance Software Engineer for Deep Learning Libraries! Do you enjoy tuning parallel algorithms and analyzing their performance? If so, we want to hear from you! As a deep learning library performance software engineer, you will be developing optimized code to accelerate linear algebra and deep learning operations on NV
Job Source: NVIDIA
Intel

Software Performance Engineer

San Jose, CA, United States
- Ending Soon
Job Details: Job Description: The Intel Software and Advanced Technology Group's Microsoft Technology Center (MTC) delivers innovative, high quality software that differentiates Intel products and maximizes customer value. MTC's mission is to ensure Microsoft operating environments, tools, and key applications run best on Intel architecture a
Job Source: Intel
Intel

Software Performance Engineer

San Jose, CA, United States
Job Details: Job Description: The Azure Solutions Engineering team within MTC at Intel has an opportunity available for a Senior Engineer to join our team and make significant impact on the bleeding edge Intel Hardware to be used by Microsoft Azure. You will work closely with Intel technology leaders and external partners to help optimize and
Job Source: Intel
Zoox

Software Performance Engineer

San Mateo, CA, United States
Zoox is building advanced self-driving hardware and software solutions. To attain the utmost efficiency that the system demands, we need you - an expert who understands both compute hardware architecture as well as the algorithms and middleware that run on it. Your contributions will be instrumental in achieving optimal power levels while maximizin
Job Source: Zoox
Zoox

Software Performance Engineer

Foster City, CA, United States
- Ending Soon
Zoox is building advanced self-driving hardware and software solutions. To attain the utmost efficiency that the system demands, we need you - an expert who understands both compute hardware architecture as well as the algorithms and middleware that run on it. Your contributions will be instrumental in achieving optimal power levels while maximizin
Job Source: Zoox
Zoox

Software Performance Engineer

Foster City, CA, United States
- Ending Soon
Zoox is building advanced self-driving hardware and software solutions. To attain the utmost efficiency that the system demands, we need you - an expert who understands both compute hardware architecture as well as the algorithms and middleware that run on it. Your contributions will be instrumental in achieving optimal power levels while maximizin
Job Source: Zoox

Software Engineer, SystemML - Scaling / Performance

Menlo Park, CA, United States

In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns. At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale GPU training and inference fleet through an observable, reliable and high-performance distributed AI/GPU communication stack. Currently, one of the team's focus is on building customized features, SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale GenAI/LLM training) from the trainer down to the inter-GPU and network communication layer. And we are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance.

Software Engineer, SystemML - Scaling / Performance Responsibilities

Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling

Minimum Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.

Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).

Preferred Qualifications

PhD in Computer Science, Computer Engineering, or relevant technical field

Experience with NCCL and distributed GPU reliability/performance improvment on RoCE/Infiniband

Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow

Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel

Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models

Experience in HPC and parallel computing

Knowledge of GPU architectures and CUDA programming

Knowledge of ML, deep learning and LLM

Start preparing

Learn about how to prepare for your interview with our interview guide, tips, and interactive experiences.

Visit interview prep

#J-18808-Ljbffr

Name	Expiration	Description
ATTBCookie*	2 years	These cookies are used to remember a user’s choice about cookies on thebigjobsite.com. Where users have previously indicated a preference, that user’s preference will be stored in these cookies.
last-search search redirect-stage original-keyword	1 day Session 1 hour 1 hour	These cookies are used by thebigjobsite.com to pass search data between our own pages.
datadome	1 year	DataDome is a cybersecurity solution to detect bot activity
jjap	1 days	Used to track if you have seen the Job Alerts prompt. Job Alerts is a service you can subscribe to to receive information about new jobs.

What job

...and where?

Similar Jobs

Software Engineer, SystemML - Scaling / Performance_

Software Performance Engineer

Performance Software Engineer

Software Performance Engineer

Software Performance Engineer

Software Performance Engineer

Software Performance Engineer

Software Performance Engineer

Software Engineer, SystemML - Scaling / Performance

Share this job

Create Email Alert