Senior AI Infrastructure Engineer, T-Mobile

Senior AI Infrastructure Engineer

T-Mobile

Location:

United States, Bothell ▼

Overland Park

Bellevue

Category:

IT - Software Development

Contract Type:

Employment contract

Salary:

113600.00 - 205000.00 USD / Year

Save Job

Apply Position

Job Description:

This role will be responsible for designing, deploying, and maintaining high-performance computing environments optimized for AI and machine learning workloads. The role involves building scalable infrastructure, ensuring efficient workload management, providing self-service and on-demand tooling, and collaborating with teams to support AI-driven applications. This role will drive operational excellence, and work with diverse hardware and software solutions to enhance performance and reliability of our on-premises AI/ML infrastructure.

Job Responsibility:

Technical System Expertise: Understands system protocols, how systems operate and data flows
Technical Engineering Services: Drives engineering projects by active contribution to the application of engineering techniques
Innovation: Contributes to designs to implement new ideas which improve an existing and new system/process/service
Technical Writing: Writes basic documentation on how technology works
Technical Leadership: Collaborates with technical teams and utilizes system expertise to deliver technical solutions
Technology Strategy: Contributes to new and existing technology options that support business goals

Requirements:

5+ years technical engineering experience, preferably in multiple technology focus areas
Expert understanding of AI/ML infrastructure components, or GPU-based systems – preferably in a high-availability, large scale environment
Hands-on Experience with NVIDIA DGX servers, BasePOD architectures, and advanced GPU technologies
Proficient in Linux/UNIX environments, including scripting/automation tools (Bash, Python, Ansible, Terraform)
Understanding of AI infrastructure security best practices
Experience with container orchestration (Kubernetes, Docker) and GPU workload management tools
Strong knowledge of networking (InfiniBand/Ethernet) and storage solutions in AI/ML contexts

Nice to have:

Understanding of CI/CD pipelines using tools such as Git, Artifactory, Jenkins, etc.
Experience with AI/ML pipelines (PyTorch, TensorFlow, RAPIDS AI, or other deep learning frameworks)
Experience with configuring and using monitoring tools (e.g., Prometheus, Grafana, NVIDIA DGCM)

What we offer:

Competitive base salary and compensation package
Annual stock grant
Employee stock purchase plan
401(k)
Access to free, year-round money coaches
Medical, dental and vision insurance
Flexible spending account
Paid time off
Paid holidays
Paid parental and family leave
Family building benefits
Back-up care
Enhanced family support
Childcare subsidy
Tuition assistance
College coaching
Short- and long-term disability
Voluntary AD&D coverage
Voluntary accident coverage
Voluntary life insurance
Voluntary disability insurance
Voluntary long-term care insurance
Mobile service & home internet discounts
Pet insurance
Access to commuter and transit programs

Additional Information:

Job Posted:

April 05, 2025

Employment Type:

Fulltime

Work Type:

On-site work

View All Jobs In This Company

Job Link Share:

Welcome to CrawlJobs.com –
Your Global Job Discovery Platform

At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.