Site Reliability Engineer GenAI

Company:  Publicis Sapient
Location: Irving
Closing Date: 18/10/2024
Salary: £100 - £125 Per Annum
Hours: Full Time
Type: Permanent
Job Requirements / Description

Job Description

The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, scalability, and availability of services across cloud and on-prem platforms, with a focus on OpenShift and Grafana. The role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime. The ideal candidate will have experience working with both cloud (GCP) and on-prem environments, particularly in managing AI/ML and GPU-based workloads.

Responsibilities:

  • Automation & Scripting: Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks. Create reusable automation scripts for efficient infrastructure management.
  • Observability & Monitoring: Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.
  • Infrastructure Management: Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.
  • Platform & Service Monitoring: Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.
  • Capacity Planning & Resource Management: Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.
  • Collaboration & Sprint Planning: Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.
  • Process Automation: Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.

Qualifications:

  • Strong experience with automation tools like Ansible and Python scripting for infrastructure management.
  • Proficiency in Grafana and Prometheus for monitoring and setting up alerting mechanisms.
  • Hands-on experience managing applications in OpenShift or other Kubernetes-based platforms.
  • Ability to automate service monitoring and infrastructure scaling in both cloud and on-prem environments, ensuring SLA compliance.
  • Experience with infrastructure management for cloud (GCP) and hybrid environments.
  • Experience with infrastructure as code (IaC) tools (Terraform).

Additional Information:

  • Flexible vacation policy; time is not limited, allocated, or accrued.
  • 16 paid holidays throughout the year.
  • Generous parental leave and new parent transition program.
  • Tuition reimbursement.
  • Corporate gift matching program.

Base Pay Range: USD 75,000 - 146,000 (varies depending on experience).

The range shown represents a grouping of relevant ranges currently in use at Publicis Sapient. Actual range for this position may differ, depending on location and specific skillset required for the work itself.

As part of our dedication to an inclusive and diverse workforce, Publicis Sapient is committed to Equal Employment Opportunity without regard for race, color, national origin, ethnicity, gender, protected veteran status, disability, sexual orientation, gender identity, or religion. We are also committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, you may contact us at or you may call us at +1-617-621-0200.

#J-18808-Ljbffr
Apply Now
Share this job
Publicis Sapient
  • Similar Jobs

  • Site Reliability Engineer (GenAI)

    Irving
    View Job
  • Site Reliability Engineer (GenAI)

    Irving
    View Job
  • Site Reliability Engineer (GenAI)

    Irving
    View Job
  • Site Reliability Engineer (GenAI)

    Irving
    View Job
  • Site Reliability Engineer

    Dallas
    View Job
An error has occurred. This application may no longer respond until reloaded. Reload 🗙