Create Email Alert

Email Alert for

ⓘ There was an unexpected error processing your request.

Please refresh the page and try again.

If the problem persists, please contact us with your issue.

Email address is already registered

You can always manage your preferences and update your interests to ensure you receive the most relevant opportunities.

Success! You're now signed up for Job Alerts

Get ready to discover your next great opportunity.

Similar Jobs

NVIDIA

Senior Site Reliability Engineer - Storage

Santa Clara, CA, United States
- Ending Soon
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers
Job Source: NVIDIA
Apple Inc.

Senior Site Reliability Engineer, Storage

Cupertino, CA, United States
Senior Site Reliability Engineer, Storage Apple is where individual imaginations gather together, committing to the values that lead to phenomenal work. Every new product we build, service we create, or Apple Store experience we deliver is the result of us making each other’s ideas stronger. That happens because every one of us shares a belief t
Job Source: Apple Inc.
NVIDIA Corporation

Senior Site Reliability Engineer - Storage

Santa Clara, CA, United States
Senior Site Reliability Engineer - Storage page is loaded Senior Site Reliability Engineer - Storage Apply locations US, CA, Santa Clara time type Full time posted on Posted 3 Days Ago job requisition id JR1979072 NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performan
Job Source: NVIDIA Corporation
Apple, Inc.

Senior Site Reliability Engineer, Storage

Cupertino, CA, United States
Summary Posted: Jun 25, 2024 Weekly Hours: 40 Role Number: 200556969 Apple is where individual imaginations gather together, committing to the values that lead to phenomenal work. Every new product we build, service we create, or Apple Store experience we deliver is the result of us making each other's ideas stronger. That happens because ever
Job Source: Apple, Inc.
NVIDIA

Senior Site Reliability Engineer - HPC Storage

Santa Clara, CA, United States
- Ending Soon
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers
Job Source: NVIDIA
TikTok

Site Reliability Engineer, Compute Platform

San Jose, CA, United States
- Ending Soon
Responsibilities TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Singapore, Jakarta, Seoul and Tokyo. Creation is the core of TikTok's purpose. Our platform is built to help imaginations thriv
Job Source: TikTok
Elastic

Platform - Site Reliability Engineer (SRE)

Mountain View, CA, United States
Elastic is a free and open search company that powers enterprise search, observability, and security solutions built on one technology stack that can be deployed anywhere. From finding documents to monitoring infrastructure to hunting for threats, Elastic makes data usable in real-time and at scale. Thousands of organizations worldwide, including B
Job Source: Elastic
Apple Inc.

Site Reliability Engineer, ASE Block Storage

Cupertino, CA, United States
Site Reliability Engineer, ASE Block Storage Software and Services To view your favorites, sign in with your Apple ID. Apple Cloud infrastructure is vast, and the storage SRE teams of Apple Cloud are building and running the next generation distributed storage systems to support Apple’s most critical services. Operating at our scale, across multi
Job Source: Apple Inc.

Senior Site Reliability Engineer - Storage Platform

Santa Clara, CA, United States

Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and possess expertise in different domains such as systems, networking, storage, coding, database management, capacity management, continuous delivery, and deployment, as well as open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Their responsibilities encompass ensuring reliable storage solutions, managing data efficiently, and providing related services to support the overall stability and performance of the production systems. SRE at NVIDIA ensures that our internal and external facing GPU cloud services have reliability and uptime as promised to the users and at the same time enables developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency, and performance. SRE is also a mindset and a set of engineering approaches to running better production systems and optimizations. Much of our software development focuses on eliminating manual work through automation, performance tuning, and growing the efficiency of production systems. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to product quality and interesting and dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem-solving, and openness is important to its success. Our organization brings together people with a wide variety of backgrounds, experiences, and perspectives. We encourage them to collaborate, think big, and take risks in a blame-free environment. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow. What You Will Be Doing: Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting. Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand. Work closely with peers on the team to improve the lifecycle of services – from inception and design, through deployment, operation, and refinement. Support services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health, including leveraging machine learning models. Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems. Be part of an on-call rotation to support production systems. What We Need To See: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience. At least 5+ years practical experience. Experience with algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems. Experience in one or more of the following: C/C++, Java, Python, Go, Perl or Ruby, AI/ML frameworks and methodologies. Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform. Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack. Ways to stand out from the crowd: Demonstrated experience in having SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success.Experience with Git, code review, pipelines, and CI/CD. Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems. Thrive in collaborative environments and enjoy working with various teams. Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker. Flexible in adapting to different working styles. NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and talented people on the planet working for us. If you're creative and autonomous, we want to hear from you! The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits . NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr

Name	Expiration	Description
ATTBCookie*	2 years	These cookies are used to remember a user’s choice about cookies on thebigjobsite.com. Where users have previously indicated a preference, that user’s preference will be stored in these cookies.
last-search search redirect-stage original-keyword	1 day Session 1 hour 1 hour	These cookies are used by thebigjobsite.com to pass search data between our own pages.
datadome	1 year	DataDome is a cybersecurity solution to detect bot activity
jjap	1 days	Used to track if you have seen the Job Alerts prompt. Job Alerts is a service you can subscribe to to receive information about new jobs.

What job

...and where?

Similar Jobs

Senior Site Reliability Engineer - Storage

Senior Site Reliability Engineer, Storage

Senior Site Reliability Engineer - Storage

Senior Site Reliability Engineer, Storage

Senior Site Reliability Engineer - HPC Storage

Site Reliability Engineer, Compute Platform

Platform - Site Reliability Engineer (SRE)

Site Reliability Engineer, ASE Block Storage

Senior Site Reliability Engineer - Storage Platform

Share this job

Create Email Alert