Create Email Alert

Email Alert for

ⓘ There was an unexpected error processing your request.

Please refresh the page and try again.

If the problem persists, please contact us with your issue.

Email address is already registered

You can always manage your preferences and update your interests to ensure you receive the most relevant opportunities.

Success! You're now signed up for Job Alerts

Get ready to discover your next great opportunity.

Similar Jobs

Guardant Health

Staff HPC Infrastructure Engineer

Palo Alto, CA, United States
- Ending Soon
Company Description Guardant Health is a leading precision oncology company focused on helping conquer cancer globally through use of its proprietary tests, vast data sets and advanced analytics. The Guardant Health oncology platform leverages capabilities to drive commercial adoption, improve patient clinical outcomes and lower healthcare costs a
Job Source: Guardant Health
Tesla, Inc.

Site Reliability Engineer, AI & HPC Infrastructure

Palo Alto, CA, United States
Site Reliability Engineer, AI & HPC Infrastructure Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware, silicon design, and Dojo. With the rapidly-growing need for more data and opt
Job Source: Tesla, Inc.
Zealogics

HPC engineer

San Jose, CA, United States
Job Responsibilities Candidates should have good domain knowledge in High-Performance Computing, script language(Shell, Python), Linux administrator, operating systems (Linux, Windows), computer network Distributed file systems (Lustre/NFS), virtualization and containerization related experience is a plus Configuration and maintenance of the HPC co
Job Source: Zealogics
eTeam, Inc.

AI Systems Engineer

San Jose, CA, United States
- Ending Soon
Job Overview: We are seeking an AI Systems Engineer to join our IT compute platforms engineering team. The AI Systems Engineer is responsible for the design, development, and administration of High-Performance Computing (HPC) infrastructure, GPU clusters, and AI workload schedulers. ABOUT YOU: You have a passion for learning. You are passionate abo
Job Source: eTeam, Inc.
kla

HPC Performance Engineer

Milpitas, CA, United States
Base Pay Range: $124,100.00 - $211,000.00 AnnuallyPrimary Location: USA-CA-Milpitas-KLAKLA's total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits identified below. Interns are eligible for some of the benefits identified below. Our pay ranges are determined by r
Job Source: kla
KLA Corporation

HPC Performance Engineer

Milpitas, CA, United States
Base Pay Range: $124,100.00 - $211,000.00 Annually Primary Location: USA-CA-Milpitas-KLA KLA's total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits identified below. Interns are eligible for some of the benefits identified below. Our pay ranges are determined
Job Source: KLA Corporation
KLA-Belgium

HPC Performance Engineer

Milpitas, CA, United States
Base Pay Range: $124,100.00 - $211,000.00 AnnuallyPrimary Location: USA-CA-Milpitas-KLAKLA’s total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits identified below. Interns are eligible for some of the benefits identified below. Our pay ranges are determined by r
Job Source: KLA-Belgium
WeRide.ai

HPC System Engineer

San Jose, CA, United States
WeRide is a smart mobility start-up whose mission is to transform mobility with autonomous driving. We are committed to build better transportation experience that's safe, efficient, affordable and joyful. We have an elite team of entrepreneurs and technologists who share the same passion and pursue continuous excellence in their work. What you wi
Job Source: WeRide.ai

HPC Infrastructure Engineer

Palo Alto, CA, United States

About Arc Institute

The Arc Institute is a new scientific institution that conducts curiosity-driven basic science and technology development to understand and treat complex human diseases. Headquartered in Palo Alto, California, Arc is an independent research organization founded on the belief that many important research programs will be enabled by new institutional models. Arc operates in partnership with Stanford University, UCSF, and UC Berkeley.

While the prevailing university research model has yielded many tremendous successes, we believe in the importance of institutional experimentation as a way to make progress. These include:

Funding: Arc will fully fund Core Investigator’s (PI’s) research groups, liberating scientists from the typical constraints of project-based external grants.

Technology: Biomedical research has become increasingly dependent on complex tooling. Arc Technology Centers develop, optimize and deploy rapidly advancing experimental and computational technologies in collaboration with Core Investigators.

Support: Arc aims to provide first-class support—operationally, financially and scientifically—that will enable scientists to pursue long-term high risk, high reward research that can meaningfully advance progress in disease cures, including neurodegeneration, cancer, and immune dysfunction.

Culture: We believe that culture matters enormously in science and that excellence is difficult to sustain. We aim to create a culture that is focused on scientific curiosity, a deep commitment to truth, broad ambition, and selfless collaboration.

Arc scaled to nearly 100 people in its first year. With $650M+ in committed funding and a state of the art new lab facility in Palo Alto, Arc will continue to grow quickly to several hundred in the coming years.

About the position

We are seeking a HPC Infrastructure Engineer to join our Software Infrastructure team, working a hybrid on-site weekly schedule at our facility in Palo Alto. In this role, you will be responsible for administering and optimizing our High-Performance Computing (HPC) cluster orchestrated by Slurm. You will work closely with researchers, developers, and IT professionals to ensure the availability, reliability, and performance of our HPC infrastructure. Your work will fuel the development of biological foundation models (i.e. Evo ; Arc’s recently released DNA foundation model), the Virtual Cell Initiative, and other cutting-edge bioinformatic projects in the context of Institute-wide efforts.

About you

You lead with empathy . You know that successful systems are more about the user than the tool. You enjoy building relationships and credibility with your colleagues.

You enjoy solving problems. Any new project is an interesting puzzle. So is a tricky troubleshooting issue. You get satisfaction from helping someone get to resolution.

You’re curious. You like to keep track of the latest developments in your field, and to learn about the substance behind your employer’s mission.

In this position you will

Manage and maintain the Slurm-based HPC cluster, ensuring high availability and performance.

Monitor system performance, identify bottlenecks, and implement optimizations.

Develop and implement strategies for system automation and configuration management.

Troubleshoot and resolve hardware, software, and network issues.

Collaborate with researchers and developers to understand their computational needs and provide appropriate resources and support.

Perform regular system updates, patches, and security enhancements.

Manage user access, quotas, and job scheduling policies.

Develop and maintain documentation for system configurations, procedures, and policies.

Participate in on-call rotations to provide high-availability support for critical issues.

Be based at our Palo Alto facility working a hybrid on-site schedule.

Requirements

Bachelor's degree in Computer Science, Information Technology, or a related field.

Proven experience in administering HPC clusters and managing Slurm workload managers or similar (Kubernetes, Grid Engine, Torque, etc.).

Strong knowledge of Linux operating systems (CentOS, Ubuntu, etc.).

Experience with configuration management tools such as Ansible, Puppet, or Chef.

Proficiency in scripting languages like Python, Bash, or Perl.

Familiarity with network protocols, storage systems, and high-speed interconnects (InfiniBand, Ethernet).

Experience with monitoring tools like Nagios, Prometheus, or Grafana.

Proficiency in software installation, configuration and development (make, bazel, gcc, gdb, conda, pip)

Experience developing and maintaining software that interacts with Nvidia GPUs, including drivers and diagnostic tools (CUDA, nvcc, nccl, etc.).

Understanding of security best practices and experience implementing security measures.

Excellent problem-solving skills and the ability to work under pressure.

Strong communication and collaboration skills.

Ability to work hybrid on-site at our facility in Palo Alto.

The base salary range for this position is $122,250 - $146,050. These amounts reflect the range of base salary that the Institute reasonably would expect to pay a new hire or internal candidate for this position. The actual base compensation paid to any individual for this position may vary depending on factors such as experience, market conditions, education/training, skill level, and whether the compensation is internally equitable, and does not include bonuses, commissions, differential pay, other forms of compensation, or benefits. This position is also eligible to receive an annual discretionary bonus, with the amount dependent on individual and institute performance factors.

#J-18808-Ljbffr

Name	Expiration	Description
ATTBCookie*	2 years	These cookies are used to remember a user’s choice about cookies on thebigjobsite.com. Where users have previously indicated a preference, that user’s preference will be stored in these cookies.
last-search search redirect-stage original-keyword	1 day Session 1 hour 1 hour	These cookies are used by thebigjobsite.com to pass search data between our own pages.
datadome	1 year	DataDome is a cybersecurity solution to detect bot activity
jjap	1 days	Used to track if you have seen the Job Alerts prompt. Job Alerts is a service you can subscribe to to receive information about new jobs.

What job

...and where?

Similar Jobs

Staff HPC Infrastructure Engineer

Site Reliability Engineer, AI & HPC Infrastructure

HPC engineer

AI Systems Engineer

HPC Performance Engineer

HPC Performance Engineer

HPC Performance Engineer

HPC System Engineer

HPC Infrastructure Engineer

Share this job

Create Email Alert