SRE Tech Lead, Hewlett Packard Enterprise

SRE Tech Lead

Hewlett Packard Enterprise

Location:

United States, Houston

Category:

IT - Administration

Contract Type:

Employment contract

Salary:

148000.00 - 340500.00 USD / Year

Save Job

Apply Position

Job Description:

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our team and drive our technical agenda. You will play a key role in ensuring the reliability, scalability, and performance of our systems and those of our customers. As a Senior SRE, you will be responsible for influencing the design, implementation, and maintenance of robust infrastructure, automating operational tasks, and enhancing system observability. You will work closely with development, operations, and security teams to create resilient, high-performing systems that support business growth.

Job Responsibility:

Be an advocate for highly available, scalable, and resilient systems in cloud or hybrid environments
Work with development and support teams to improve the implementation to achieve better customer experience and lower operating costs
Define and manage Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs) to ensure system reliability
Proactively identify performance bottlenecks and implement optimizations to improve system efficiency
Automate and drive for the automation of repetitive tasks and operational workflows to reduce toil and improve system efficiency
Audit, verify and improve incident response procedures, including runbooks, and post-incident reviews
Collaborate with security teams to ensure compliance with best practices in cloud security, access control, and vulnerability management
Mentor junior SREs and software engineers, fostering a culture of reliability and operational excellence
Work closely with development teams to build resilient applications with best-in-class reliability and performance
Advocate for SRE best practices across the organization, promoting a culture of shared responsibility for system reliability

Requirements:

12+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Operations, Software Engineering
Experience with cloud platforms (AWS, Azure), hypervisors (VMware, KVM) and container orchestration (Kubernetes, Docker)
Proficiency in programming/scripting languages such as Python, Go, Bash
Hands-on experience with monitoring & logging tools (Prometheus, Grafana, ELK stack, OpsRamp)
Solid understanding of networking, security best practices, and Linux systems administration
Strong problem-solving skills and ability to troubleshoot complex distributed systems
Excellent communication skills and ability to work in a collaborative, distributed and multi-cultural team environment

Nice to have:

Experience with distributed systems, microservices architectures, and chaos engineering
Familiarity with machine learning-based anomaly detection for observability
Contributions to open-source projects or active participation in the SRE/DevOps community

What we offer:

Comprehensive suite of benefits supporting physical, financial, and emotional wellbeing
Career development programs
Inclusive work environment

Additional Information:

Job Posted:

March 20, 2025

Expiration:

June 30, 2025

Employment Type:

Fulltime

Work Type:

Hybrid work

View All Jobs In This Company

Job Link Share:

Welcome to CrawlJobs.com –
Your Global Job Discovery Platform

At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.