About Company
Etisalat by e& is a global technology and investment group, headquartered in Abu Dhabi, United Arab Emirates, with a steadfast commitment to driving digital transformation and fostering seamless connectivity across the region and beyond. For over four decades, e& has been at the forefront of the telecommunications industry, continuously evolving to become a diversified technology powerhouse. Our extensive portfolio now spans innovative digital services, cutting-edge infrastructure, advanced cybersecurity solutions, and transformative B2B and B2C offerings, catering to millions of customers. We are passionate about empowering societies by proactively addressing future technological demands, ensuring an unparalleled, highly reliable, and secure digital experience for all. At e&, we cultivate a dynamic, inclusive, and forward-thinking work environment where innovation, collaborative spirit, and a culture of continuous learning are foundational to our collective success. Join our world-class team and contribute to shaping the future of digital connectivity and technology in a rapidly evolving global landscape.
Job Description
We are actively seeking a highly skilled and profoundly experienced Senior Platform Reliability Engineer to augment our innovative team located in Umm Al Quwain. This pivotal role places you at the epicenter of ensuring the ultra-high availability, unparalleled performance, and robust scalability of our mission-critical digital platforms and services. You will serve as a fervent champion of Site Reliability Engineering (SRE) principles, meticulously applying advanced engineering best practices to our operational frameworks, with a sharp focus on comprehensive automation, proactive system monitoring, and resilient incident management strategies. The ideal candidate will demonstrate an exceptional understanding of complex distributed systems, modern cloud-native architectures, and an unwavering passion for optimizing operational excellence. You will engage in close, strategic collaboration with our distinguished development, operations, and product teams to meticulously design, implement, and rigorously maintain highly resilient systems engineered to proactively withstand potential failures and scale effortlessly under varying loads. Your profound expertise will be indispensable in diagnosing intricate technical issues, architecting and implementing preventative measures, and perpetually enhancing our infrastructure and application reliability to meet and exceed industry benchmarks. This is an extraordinary opportunity to significantly contribute to a transformative digital landscape and to fundamentally influence the foundational stability and cutting-edge performance of services that enrich the lives of millions.
Key Responsibilities
- Architect, implement, and maintain highly scalable, reliable, and performant infrastructure and application components.
- Develop and deploy sophisticated monitoring, alerting, and logging solutions to ensure proactive issue detection and rapid resolution.
- Lead and manage critical incident response, conduct thorough root cause analysis (RCA), and facilitate comprehensive post-mortem processes to prevent recurrence.
- Automate operational tasks, streamline deployment pipelines, and optimize infrastructure provisioning using advanced configuration management tools and scripting.
- Collaborate extensively with development teams to embed reliability, scalability, and performance considerations into the early design phases of new features and services.
- Define, implement, and meticulously track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for key services.
- Manage, optimize, and secure cloud infrastructure (e.g., AWS, Azure, GCP) for maximum cost-efficiency, superior performance, and robust security posture.
- Participate actively in on-call rotations to provide expert support for critical production systems and troubleshoot complex, high-priority issues.
- Drive continuous improvement initiatives focused on enhancing system reliability, operational efficiency, and developer productivity across the organization.
Required Skills
- Minimum of 5 years of progressive experience in Site Reliability Engineering (SRE), DevOps, or a closely related platform engineering role.
- Demonstrated strong proficiency in major cloud platforms (AWS, Azure, or GCP), including extensive experience with compute, networking, storage, and database services.
- Expert-level capability in scripting and automation languages such as Python, Go, or advanced Bash scripting.
- Extensive hands-on experience with containerization technologies, particularly Docker and Kubernetes.
- Profound proficiency with CI/CD pipelines and associated tools (e.g., GitLab CI, Jenkins, Argo CD, Spinnaker).
- Deep understanding of monitoring and alerting systems (e.g., Prometheus, Grafana, ELK Stack, Datadog).
- Solid foundational and advanced knowledge of Linux operating systems, networking protocols (TCP/IP, DNS, HTTP), and system internals.
- Proven experience with infrastructure as code (IaC) tools like Terraform, CloudFormation, or Ansible.
- Exceptional problem-solving, analytical, and debugging skills applied to complex distributed systems.
Preferred Qualifications
- Bachelor's or Master's degree in Computer Science, Computer Engineering, or a closely related technical field.
- Possession of relevant professional cloud certifications (e.g., AWS Certified DevOps Engineer – Professional, Azure DevOps Engineer Expert, GCP Professional Cloud DevOps Engineer).
- Extensive experience with large-scale microservices architectures, event-driven systems, and message queues (e.g., Kafka, RabbitMQ).
- Familiarity with database administration and optimization for both SQL and NoSQL databases.
- In-depth knowledge of cybersecurity best practices, compliance standards (e.g., ISO 27001), and vulnerability management.
- Proven track record of mentoring junior engineers, leading technical projects, and fostering a culture of technical excellence.
Perks & Benefits
- Highly competitive, tax-free salary package.
- Comprehensive health and wellness insurance coverage for employees and eligible dependents.
- Generous annual leave and public holidays.
- Exceptional professional development opportunities, including sponsored certifications and training programs.
- Access to cutting-edge technologies and participation in highly impactful, innovative projects.
- A vibrant, multicultural, and collaborative work environment that encourages growth.
- Robust employee wellness programs and initiatives.
- Potential for relocation assistance for international candidates (where applicable).
How to Apply
Interested candidates are strongly encouraged to submit their application by clicking the secure link below. Please ensure your resume is meticulously updated and clearly highlights your extensive experience with SRE principles, major cloud platforms, and automation methodologies. We eagerly anticipate reviewing your application and exploring how your expertise can contribute to our groundbreaking endeavors!
