Manager, Site Reliability Engineering- Commerce Platforms
Job description
Company Background:
Genuine Parts Company (“GPC” or the “Company”), founded in 1928 and based in Atlanta, Georgia, is a leading specialty distributor engaged in the distribution of automotive and industrial replacement parts and value-added services. The Company operates a global portfolio of businesses with more than 10,000 locations across the world. GPC has approximately 50,000 global employees. The Company has operations in the United States, Canada, Mexico, Australia, New Zealand, Indonesia, Singapore, France, the U.K., Germany, Poland, the Netherlands, Belgium, Spain and China.
Position Purpose:
We are seeking a highly motivated, experienced Manager, Site Reliability Engineering to join the world’s leading distributor of automotive and industrial replacement parts and value-added services operating 5,500+ locations and servicing more than 20,000 locations in the U.S and Canada. This role will report to the Director, Platform Engineering.
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures that GPC’s services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to users' needs and a fast rate of improvement. Additionally, SRE’s will keep an ever-watchful eye on our systems capacity and performance.
The SRE team is responsible for driving service uptime and quality in 24 x 7 environments. The team will enhance observability, troubleshoot applications, support cloud-based transformations, develop tools & automation, and impact the design of the future platform architecture. Additionally, they will be involved in developing support standards for all applications and adheres to those plans to provide the necessary level of production SLO/SLI/SLAs.
This role supports multiple technology areas and requires partnerships with teams from multiple locations, skill sets, and backgrounds. As such, we are seeking a leader with strong communication skills in addition to a solid foundation of technical skills, analytical abilities, and end-to-end troubleshooting techniques.
Responsibilities:
- Lead a team of Software/Systems Engineers on projects for users and be directly responsible for uptime.
- Own end-to-end availability, performance of key services, track process efficiency and service availability using established Key Performance Indicators (KPIs)
- Pro-actively detecting, monitoring and alerting against any stability or reliability issues and managing incidents appropriately within defined SLAs
- Lead incident resolution and problem management
- Direct and manage escalation and resolution calls with members from various teams
- Communicate progress and resolution to appropriate stakeholders and leadership
- Conduct Post-incident reviews, document findings, and take action on learnings
- Review incident trends, identify repeating issues, perform root cause analysis, and build automation to prevent problem recurrence. Automate response to all non-exceptional service conditions
- Lead by example, mentor the team and establish credibility through quality technical execution.
- Manage and optimize on-call rotations across continents, using a follow-the-sun model.
- Recommend application changes to improve application performance, reliability, and cost to operate
- Work with Engineering to transition applications from one platform to another
- Review existing processes and recommend changes or institute new processes as necessary, including observability, alerting, operations, engineering and system tuning, etc.
- Generate high-quality documentation, detailing the platform to application architectures and common patterns, runbooks, SOPs, knowledge base etc..
- Manage a highly technical employee base and ensure we maintain a high bar for performance and culture
Location:
- GPC has two work locations to choose from, Duluth or Atlanta office.
- We offer a Flexible Work Policy that permits eligible employees to work remotely
Desired Qualifications & Experiences:
- 10+ years of relevant work experience in software engineering & technology
- At least 5 years’ experience in an SRE or very similar leadership role
- Deep expertise in the mentality, processes, and tools needed to deliver SRE principles
- Cloud Services experience with Google Cloud / Azure / AWS
- Experience with high throughput / low latency / highly available microservice based architecture
- Proficiency in infrastructure, network, database, operating systems, or security troubleshooting and remediation.
- Architecture-level knowledge of Windows and Linux and Infrastructure systems
- Experience with production deployment, monitoring and operational support for enterprise-class applications
- Experience working with Continuous Integration/ Continuous Deployment tools
- Experience in performance diagnostics, capacity planning, performance architecture design, performance tuning, performance monitoring
- A strong mix of Software Engineer and Operation Support skills.
- Eager to learn new technologies and platform patterns
- Strong customer service orientation with a focus on managing and exceeding customer expectations
- Degree in Computer Science or Engineering fields, or equivalent experience
jackharris.com is the go-to platform for job seekers looking for the best job postings from around the web. With a focus on quality, the platform guarantees that all job postings are from reliable sources and are up-to-date. It also offers a variety of tools to help users find the perfect job for them, such as searching by location and filtering by industry. Furthermore, jackharris.com provides helpful resources like resume tips and career advice to give job seekers an edge in their search. With its commitment to quality and user-friendliness, jackharris.com is the ideal place to find your next job.