Company DescriptionPublicis Sapient is a digital transformation partner helping established organizations get to their future, digitally enabled state, both in the way they work and the way they serve their customers. We help unlock value through a start-up mindset and modern methods, fusing strategy, consulting and customer experience with agile engineering and problem-solving creativity. United by our core values and our purpose of helping people thrive in the brave pursuit of next, our 20,000+ people in 53 offices around the world combine experience across technology, data sciences, consulting, and customer obsession to accelerate our clients’ businesses through designing the products and services their customers truly value.Job DescriptionThe Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.Responsibilities:Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.QualificationsQualifications:Strong experience with automation tools like Ansible and Python scripting for infrastructure management.Proficiency in Grafana and Prometheus for monitoring and setting up alerting mechanisms.Hands-on experience managing applications in OpenShift or other Kubernetes-based platforms.Ability to automate service monitoring and infrastructure scaling in both cloud and on-prem environments, ensuring SLA compliance.Experience with infrastructure management for cloud (GCP) and hybrid environments.Experience with infrastructure as code (IaC) tools (Terraform).Additional InformationFlexible vacation policy; time is not limited, allocated, or accrued• 16 paid holidays throughout the year• Generous parental leave and new parent transition program• Tuition reimbursement• Corporate gift matching programBase Pay Range: USD 75,000 - 146,000 (varies depending on experience)The range shown represents a grouping of relevant ranges currently in use at Publicis Sapient. Actual range for this position may differ, depending on location and specific skillset required for the work itself.As part of our dedication to an inclusive and diverse workforce, Publicis Sapient is committed to Equal Employment Opportunity without regard for race, color, national origin, ethnicity, gender, protected veteran status, disability, sexual orientation, gender identity, or religion. We are also committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, you may contact us at or you may call us at +1-617-621-0200.