Senior Site Reliability Engineer
Palo Alto, CA, United States
POSITION SUMMARY:
Velocity Global seeks a Senior Site Reliability Engineer (SRE) with extensive observability experience. In this role, you will help to lead the automation and support efforts of our cloud Infrastructure, identify strategies to improve our full-stack telemetry and monitoring capabilities, and mentor other SREs who contribute to observability-related work.
SREs work cross-functionally with DevOps and Engineering teams, combining operations work with software engineering principles to enable high availability of production systems. You will serve as a partner to our Engineering organization to help make their services more performant, scalable, observable, and reliable. Every engineering team at Velocity Global should be responsible for the software they build. SREs are critical in providing the tools, practices, and expertise to make that happen.
We are growing and evolving the SRE team to help meet Velocity Global’s product-first reliability goals for 2023 and beyond.
Responsibilities include
Automating observability and alerting across an ever-changing landscape of microservices
Automated Service Reliability Scorecards and Production Readiness Standards
Chaos Engineering and Game Day Simulations to discover and test fixes for weak spots that would otherwise not be identified until a real-life production incident occurred
Software engineering project work, proposed and driven by individual SRE team members, to remove operational bottlenecks and increase velocity in ways we've never considered before
Expand and improve our observability and monitoring footprint
Collaborate with the Engineering and DevOps to create architectural plans, define project requirements, and establish technical standards
Review the work of other team members, help them get unblocked, and provide mentoring
Improve common operational challenges by building tools and automating scripts
Serve as the on-call incident commander to help debug and drive resolution of production reliability issues, contribute to the postmortem, and work to prevent recurrence
Participate in design and production reviews for new features, products, or infrastructure
Audit and tune the configuration of systems owned by other engineering teams
Plan for the growth of Velocity Global’s infrastructure and infrastructure reliability/resiliency
Designing and implementing High Availability architecture underlying Velocity Global’s platform
Creating Disaster Recovery solutions, including backups, redundant systems, and emergency response processes
This individual will report to the Manager, Site Reliability Engineering
The team this role is a part of is primarily based out of the United States.
Qualifications/Skills
SREs combine some level of experience in both software engineering and operations. They may hail from various backgrounds and job titles, including production or application engineers, software developers with a strong DevOps mindset, SysAdmins with solid systems and programming skills, and Cloud Infrastructure or DevOps engineers. We are looking for someone with the following experience:
5+ years working in a relevant role, including 2+ years of technical leadership experience mentoring more junior engineers
3+ years of experience architecting and administrating observability stacks, either managed or self-hosted (e.g., DataDog, New Relic, Prometheus, Elastic Stack/ELK, AppDynamics)
Solid experience and understanding of AWS cloud services
Operation of containerized microservices running on public cloud, asynchronous event processing, and databases
Strong understanding of Linux, GitLab, and CI/CD pipelines
On-call support of highly available production systems
Design and build new tools to automate repetitive tasks, prevent incidents, or improve TTR using an object-oriented programming language such as Python
Infrastructure as Code using tools like Terraform, Terragrunt, or Cloud Formation
Understand how application components interact and contribute to architectural discussions
Unwavering commitment to operational security and best practices
Identify problems but also propose solutions, then go out and implement them--from submitting a merge request on another team's repository to scoping out a new reliability project
Motivated to help other teams improve their service reliability through reviews, pair programming, hands-on training, and continuous improvement of tooling and services
In the spirit of winning together, the position will be based in Palo Alto and in-office collaboration is required for at least one day per week.
Our job titles may span more than one career level. The base pay depends upon many factors, such as training, transferable skills, work experience, business needs, and market demands. The base pay range is subject to change and may be modified. This role is eligible for annual performance-based bonuses, flexible time off, health care benefits, retirement savings, and employee incentive plans.
Pay Range
$140,300—$172,000 USD
GO FARTHER WITH VELOCITY
At Velocity Global, we’re building a dream team made up of the world’s best talent. We’re looking for people like you to join us as we make opportunity borderless for people everywhere.
About Velocity Global
At Velocity Global, our values represent who we are and the company we want to be. We harness the power of unity, diversity, and collaboration, drive for impact, and win as a team - bringing our unique talents together to achieve our common goals. In partnership with our customers and ourselves, we are better together, and together, we win.
Please refer to our present benefits offering here.
#J-18808-Ljbffr