CrawlJobs Logo

Site Reliability Engineer or Service Availability Manager

https://www.marriott.com Logo

Marriott Bonvoy

Location Icon

Location:
United States, Bethesda

Category Icon
Category:
IT - Administration

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

44.90 - 66.11 USD / Hour

Job Description:

The SRE Service Availability Manager plays a key role in ensuring the peak performance and availability of our Enterprise IT infrastructure and services. This position combines proactive site reliability engineering with adept incident command to lead our efforts in minimizing service disruptions and enhancing our technology landscape. With a focus on automation, cloud technologies, and continuous process improvement, the ideal candidate brings a mix of technical expertise and leadership skills, aimed at delivering exceptional service reliability. This role demands a proactive problem-solver with extensive experience in IT operations and a passion for innovation, ready to tackle challenges in a dynamic, 24x7x365 environment.

Job Responsibility:

  • Serve as Incident Commander during major incidents, leading response efforts to restore services and minimize impact on business and consumer operations
  • Design and implement automation tools to reduce manual intervention, improve system performance, and prevent incidents
  • Assess application architectures to identify key monitoring points and performance indicators
  • Develop and maintain comprehensive monitoring and alerting frameworks to detect and address anomalies before they escalate to incidents
  • Collaborate closely with development, operations, and support teams for continuous improvement of service reliability and incident response processes
  • Conduct thorough post-mortems to analyze incidents, identify root causes, and implement preventative measures to avoid recurrence
  • Effectively communicate incident status, impact, and post-incident reports to stakeholders at all levels of the organization
  • Stay informed on the latest industry trends, technologies, and practices in site reliability engineering and incident management.

Requirements:

  • 5+ years of experience in an information technology environment
  • 3 years of experience in information technology focused on IT Operations that include troubleshooting complex network, server, storage, and/or application issues
  • 2 years minimum operations experience involving incident, problem, change, and release management that included leading calls and documenting outcomes
  • Undergraduate degree or or equivalent experience/certification
  • Ability to cover shifts in a 24x7x365 environment and on-call responsibilities
  • Proficiency in scripting languages (Python, Shell) and familiarity with automation tools (such as Ansible, Jenkins)
  • Experience with cloud platforms (AWS, Azure, GCP), infrastructure as code, and containerization technologies
  • Experience in incident command or incident management in a technology environment
  • Strong problem-solving, organizational, and analytical skills.

Nice to have:

  • ITIL Foundations v3+ Certification
  • Demonstrated experience with ITSM suites, e.g., ServiceNow
  • Demonstrated experience with various monitoring, performance, or capacity tools
  • Experience with continuous integration/continuous deployment (CI/CD) pipelines and DevOps practices
  • Familiarity with Site Reliability Engineering principles and concepts
  • Strong leadership qualities, including decisiveness, and the ability to motivate teams, along with the ability to manage stressful situations calmly and effectively
  • Ability to create constructive relationships, influence, and communicate with varying levels of associates and management
  • Ability to solve complex, cross-functional issues
  • Strong knowledge of Server, Storage, Network, Middleware, Application and Cloud technologies
  • A high degree of curiosity and a drive to seek more efficient ways of delivering service.
What we offer:
  • Medical insurance
  • Dental insurance
  • Vision insurance
  • Health care flexible spending account
  • Dependent care flexible spending account
  • Life insurance
  • Disability insurance
  • Accident insurance
  • Adoption expense reimbursements
  • Paid parental leave
  • 401(k) plan
  • Stock purchase plan
  • Discounts at Marriott properties
  • Commuter benefits
  • Employee assistance plan
  • Childcare discounts.

Additional Information:

Job Posted:
March 22, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.