Hardware Systems Engineer
San Francisco, CA, United States
About the department
Cloudflare’s Infrastructure group is responsible for building our global network. Our Hardware Engineering team helps research, develop, test, and deploy new equipment enabling 20% of the world’s internet traffic to be served smoothly. Deployed across 285 cities in 100+ countries, the hardware we select helps improve the security, reliability, and performance of the Internet.
About the Role
We need to make thoughtful infrastructure choices affecting a significant portion of the Internet. Hardware we work with includes servers, routers, switches, optical equipment, power distribution units, cables, optics, and more. As a Hardware Systems Engineer, you will work with colleagues on the Hardware Engineering, Product teams, and Hardware Sourcing teams to troubleshoot and maintain Cloudflare’s worldwide fleet of storage and compute servers.
What you'll do
Develop and maintain automation tools to update firmware on servers and components in Cloudflare’s fleet
Work with software teams to validate bug fixes and performance of new firmware revisions
Test and deploy firmware updates to the fleet, monitoring the progress of the rollout for compliance and reliability
Work with server and component vendors to obtain, debug, and maintain the latest updates
Work with our Site Reliability Engineering teams to triage bug reports
Support our Data Centre Engineering teams in resolving hardware issues
Communicate your results and updates through blog posts, internal talks, and tickets
Examples of desirable skills, knowledge and experience
Bachelor’s degree in Computer Engineering, Electrical Engineering, or Computer Science
Desire to learn about the Cloudflare hardware used by almost 20% of all web sites
Desire to learn how a diverse server fleet is managed at scale
Desire to learn the tools Cloudflare uses to maintain and monitor our hardware
Knowledge of PXE booting
Knowledge of configuration management, in particular we use salt to manage our fleet
Knowledge of Redfish, IPMI and server remote management protocols
Knowledge of running production mission critical systems
Bonus Points
Familiarity with server hardware architecture
Knowledge of debugging server hardware faults and the ability to engage with our sourcing team and vendors to improve quality
Experience of managing large fleets comprising of thousands of servers
Experience of observability and monitoring tools such as Prometheus and Grafana, and the ability to observe trends over time
Experience scripting and programming, in particular python and bash
Experience with software development tools and processes such as git, Bitbucket and TeamCity and Jira
#J-18808-Ljbffr