ML Infrastructure Service Reliability Engineer- Apple Services Engineering
Lex
Software Engineering, Other Engineering, Data Science
Bengaluru, Karnataka, India
Posted on May 22, 2026
Summary
At Apple, we don’t just build products — we create transformative experiences
that have reshaped entire industries. Our innovation is driven by the diversity of
our people and their ideas, inspiring everything we do. Imagine the impact you
could make. Join Apple and help us leave the world better than we found it.
The ML Infrastructure team is responsible for managing Apple’s largest ML
compute platform, multi-cloud storage abstraction and caching platform, which
supports critical machine learning training workloads that power user-facing
features across the Apple ecosystem. Operating across both first-party and
third-party cloud environments brings complex and unique challenges.
As a Site Reliability Engineer (SRE) on the ML Infrastructure team, you’ll be
expected to address these challenges through a strong foundation in cloud
object storage, data analysis, automation, collaboration, and advanced
expertise in Kubernetes. Our team oversees the full infrastructure stack — from
low-level nodes to the complete network architecture — ensuring our platform
remains highly available, resilient, and efficient at scale.
Description
We are seeking an experienced Software and Systems Engineer to join our
dynamic team. This role demands a proactive mindset, technical excellence,
and a collaborative spirit.
The Ideal Candidate Will Demonstrate
Strong critical thinking and a high degree of individual accountability
Effective communication and collaboration skills
A genuine passion for Infrastructure as a Service (IaaS)
A commitment to automation and operational efficiency
Ownership of projects from design through delivery
A solutions-oriented approach, coupled with the ability to gain alignment on
technical direction
Consistent and timely execution of design implementations aligned with
project objectives
The ability to provide constructive technical feedback, fostering team-wide
growth and continuous improvement
Responsibilities
Learn about accessibility in Apple’s workplace
Role Number: 200664425-0321
At Apple, we don’t just build products — we create transformative experiences
that have reshaped entire industries. Our innovation is driven by the diversity of
our people and their ideas, inspiring everything we do. Imagine the impact you
could make. Join Apple and help us leave the world better than we found it.
The ML Infrastructure team is responsible for managing Apple’s largest ML
compute platform, multi-cloud storage abstraction and caching platform, which
supports critical machine learning training workloads that power user-facing
features across the Apple ecosystem. Operating across both first-party and
third-party cloud environments brings complex and unique challenges.
As a Site Reliability Engineer (SRE) on the ML Infrastructure team, you’ll be
expected to address these challenges through a strong foundation in cloud
object storage, data analysis, automation, collaboration, and advanced
expertise in Kubernetes. Our team oversees the full infrastructure stack — from
low-level nodes to the complete network architecture — ensuring our platform
remains highly available, resilient, and efficient at scale.
Description
We are seeking an experienced Software and Systems Engineer to join our
dynamic team. This role demands a proactive mindset, technical excellence,
and a collaborative spirit.
The Ideal Candidate Will Demonstrate
Strong critical thinking and a high degree of individual accountability
Effective communication and collaboration skills
A genuine passion for Infrastructure as a Service (IaaS)
A commitment to automation and operational efficiency
Ownership of projects from design through delivery
A solutions-oriented approach, coupled with the ability to gain alignment on
technical direction
Consistent and timely execution of design implementations aligned with
project objectives
The ability to provide constructive technical feedback, fostering team-wide
growth and continuous improvement
Responsibilities
- Participates in a rotating on-call schedule, including occasional weekend
- coverage when necessary
- Currently headquartered in Cupertino, with active expansion in Bangalore to
- support global operations across time zones
- Leverages a diverse stack including open-source tools, commercial solutions,
- and internally developed systems
- Encourages open dialogue, values strong ideas, and recognizes impactful
- results
- 5+ years experience in building, operating and scaling a large application in a
- private, public or hybrid cloud environment
- Deep expertise in Kubernetes, with hands-on experience using platforms such
- as Google Kubernetes Engine (GKE) or Amazon Elastic Kubernetes Service
- (EKS)
- Proficient in designing, developing, and releasing code in languages such as
- Python, Go, or Rust
- Practical experience with object storage technologies, including Amazon S3
- or Google Cloud Storage (GCS)
- Strong background in designing and troubleshooting complex networking
- issues in both public and private cloud infrastructures
- Solid understanding of Linux internals, standard networking protocols, and
- distributed systems architecture
- Proven drive to automate manual operations and enhance processes through
- continuous iteration
- Strong understanding of best practices for deploying large-scale, distributed
- applications
- Hands-on experience managing diverse system environments using
- configuration management tools or software delivery platforms such as
- Spinnaker, Helm, or Flux
- Demonstrated expertise in deploying, supporting, and monitoring both new
- and existing services, platforms, and application stacks
- Solid familiarity with container orchestration and management using
- Kubernetes
Learn about accessibility in Apple’s workplace
Role Number: 200664425-0321