CrawlJobs Logo

SRE Tech Lead

https://www.hpe.com/ Logo

Hewlett Packard Enterprise

Location Icon

Location:
United States, Houston

Category Icon
Category:
IT - Administration

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

148000.00 - 340500.00 USD / Year

Job Description:

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our team and drive our technical agenda. You will play a key role in ensuring the reliability, scalability, and performance of our systems and those of our customers. As a Senior SRE, you will be responsible for influencing the design, implementation, and maintenance of robust infrastructure, automating operational tasks, and enhancing system observability. You will work closely with development, operations, and security teams to create resilient, high-performing systems that support business growth.

Job Responsibility:

  • Be an advocate for highly available, scalable, and resilient systems in cloud or hybrid environments
  • Work with development and support teams to improve the implementation to achieve better customer experience and lower operating costs
  • Define and manage Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs) to ensure system reliability
  • Proactively identify performance bottlenecks and implement optimizations to improve system efficiency
  • Automate and drive for the automation of repetitive tasks and operational workflows to reduce toil and improve system efficiency
  • Audit, verify and improve incident response procedures, including runbooks, and post-incident reviews
  • Collaborate with security teams to ensure compliance with best practices in cloud security, access control, and vulnerability management
  • Mentor junior SREs and software engineers, fostering a culture of reliability and operational excellence
  • Work closely with development teams to build resilient applications with best-in-class reliability and performance
  • Advocate for SRE best practices across the organization, promoting a culture of shared responsibility for system reliability

Requirements:

  • 12+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, Operations, Software Engineering
  • Experience with cloud platforms (AWS, Azure), hypervisors (VMware, KVM) and container orchestration (Kubernetes, Docker)
  • Proficiency in programming/scripting languages such as Python, Go, Bash
  • Hands-on experience with monitoring & logging tools (Prometheus, Grafana, ELK stack, OpsRamp)
  • Solid understanding of networking, security best practices, and Linux systems administration
  • Strong problem-solving skills and ability to troubleshoot complex distributed systems
  • Excellent communication skills and ability to work in a collaborative, distributed and multi-cultural team environment

Nice to have:

  • Experience with distributed systems, microservices architectures, and chaos engineering
  • Familiarity with machine learning-based anomaly detection for observability
  • Contributions to open-source projects or active participation in the SRE/DevOps community
What we offer:
  • Comprehensive suite of benefits supporting physical, financial, and emotional wellbeing
  • Career development programs
  • Inclusive work environment

Additional Information:

Job Posted:
March 20, 2025

Expiration:
June 30, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.