Systems Reliability Engineer - SME
Herndon, VA, United States
Job Brief
.
Job Description
HTS (iNovex) was built on the principle that people matter first and foremost.We believe in providing a strong work/life balance by investing in our employees and encouraging professional and personal growth.We do this by offering exceptional benefits, flexible schedules, and the tools necessary to achieve success through paid training, mentoring, and the opportunity to work alongside top-notch technical professionals.
Your effort and expertise are crucial to the success and execution of this impactful mission that is critical in ensuring mission success through System Engineering, Network Engineering, Systems Integration, and Software Engineering & Development, by improving, protecting, and defending our Nation's Security.
We are looking for experienced Systems Engineer/Site Reliability Engineer (SRE) to join our technology-based program supporting a key Government customer. The Systems Engineer/SRE provides subject expertise and guidance to IT developers during the software development life cycle. Overseeing the development, testing, and implementation of technical solutions. Determining whether technical solutions meet defined requirements. The SRE may also provide Agile DevOps support to mission critical systems. The Systems Engineer/SRE may have the opportunity to build strong systems, software, and cloud environments and provide operations and maintenance for critical systems. The candidate will provide technical expertise and support in the design, development, implementation and testing of customer tools and applications. Based in a DevOps framework, participate in and/or direct major deliverables of projects through all aspects of the software development lifecycle including scope and work estimation, architecture and design, coding, and unit testing. Required Education, Experience, & Skills The Systems Engineer will support the team in the following activities (including but not limited to):
Ensuring reliability, getting systems back to steady-state as quickly as possible
Eliminating toil, automating wherever possible
Driving better cross-team collaboration
Gaining full visibility into IT systems and services for system health
Identify system deficiencies and recommend solutions
Developing Service Level Indicators (SLI) for IT systems and services
Developing Service Level Objectives (SLO) for IT systems and services
Developing Service Level Agreements (SLA) for IT systems and services
Maintenance and continuous improvement of the processes, standards, policies, working methods and tools
Ensure appropriate tools and processes are in place to have a development/production environment that is reliable and reproducible
Ensure tool configuration consistency across Development, Testing, Integration, and Production environments
Participate in on-going production support and end user support
Research, understand, and develop using new technologies and standards as needed
Evaluates interface between hardware and software, operational requirements, and characteristics of overall system
Required Education, Experience, & Skills:
A minimum of sixteen (16) years relevant experience with Bachelor's or Master's degrees
Knowledgeable in Incident Management, organizing Incident Response Teams, communicating with stakeholders and devising a strategy for resolving incidents
Good understanding of how the incident response role is structured and incident response concepts to automate the complex process required for rapid, effective incident resolution
Knowledgeable with SLO to help Operations provide, define, improve SLO for specific reliability for IT systems
CI/CD implementation expertise
Scripting to automate tasks, extract information, front-end and back-end such as MS PowerShell, Python, JavaScript, Ruby, PHP etc.
Ability to efficiently and appropriately estimate work effort requirements
Ability to communicate effectively through written and verbal methods
Ability to handle multiple tasks and meet deadlines
Ability to work independently and in a team environment
Able to adapt to a constantly changing environment
Aptitude and willingness to work with variety of newer emerging technologies/tools as opportunities demand
Ability to deliver enhanced functionality, aid with new implementations, and provide continuous support within the scheduled time while preserving system integrity required
High degree of initiative, creativity, and technical ability to function on this team required
Ability to identify issues and implement corrective actions
Preferred Education, Experience, & Skills
The ideal candidate would also have IT project management experience, and be familiar with Scrum, Lean, Agile and DevOps.
Experience with Java, Ruby, DevOps and DevSecOps
Knowledgeable with IT Operations Management (ITOM) software, intended to represent all the tools needed to manage the provisioning, capacity, performance and availability of computing, networking, and application resources - as well as the overall quality, efficiency, and experience of their delivery.
Understanding of Quality Assurance and Test Automation for software pre-deployment
Good understanding of DevOps concepts and best practices
Issue troubleshooting experience.
Understanding of Networking concepts
Linux/Unix Concepts
ServiceNow knowledge in developing products using JavaScript and other coding applications.
Database Administration (Oracle or MYSQL) experience
We're an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.
#J-18808-Ljbffr