We are looking for a Principal Site Reliability Engineering (SRE) engineer with 10-plus years of industry experience to join our Sovereign Cloud Operations team. This role is responsible for ensuring the reliability and availability of our sovereign cloud production systems and driving automation and tooling enhancements for our operators. To achieve this outcome, they will work closely with the Oracle Cloud Infrastructure service team and our Operability Improvement organization to implement and maintain a high level of system hygiene and identify and address potential issues that impact the positive experience of our cloud customers.
We are looking for a candidate who is passionate about operations and willing to take ownership of our systems' performance. The candidate should be comfortable working in a fast-paced environment and able to quickly identify and address issues. You must be a strong collaborator, developing solid partnerships across the business to foster outcomes for our customers. Experience with cloud infrastructure architecture and interaction is a must to be successful in this role.
Primary Responsibilities:
- Serve as a technical leader for OCI cloud services across the operations teams servicing sovereign realms.
- Deep dive into complex customer issues and assist customer support, sovereign cloud operators, and customer account managers in resolving them.
- Decompose operational issues impacting sovereign cloud operators’ efficiency and help facilitate solutions.
- Collaborate with the Operability Improvement organization to drive tooling and automation to improve change safety and reduce operator toil.
- Provide rapid ad hoc solutions (e.g., scripting/coding) to provide near-term operational improvements as a stop-gap measure while long-term solutions are developed.
- Establish yourself as a technical leader and operational champion for the sovereign cloud operator. Passion and love for operations, as an engineering discipline, are essential to success in this role.
Qualifications:
- U.S. Citizenship Required.
- Bachelor’s degree or higher in Computer Science or a related field.
- 10+ years of SRE/DevOps experience (operations-focused).
- Experience operating services in one of the significant Clouds such as AWS, OCI, Azure, etc.
- Knowledge/Experience working with government clients to deliver IT services.
- Strong knowledge of cloud infrastructure, distributed systems, and network architecture.
- Proven track record of supporting large, complex, scalable systems/applications in an agile environment.
- Change management, continuous integration, and deployment best practices.
- Strong problem-solving and troubleshooting skills, with the ability to analyze complex systems and identify areas for improvement.
- Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams.
- Proficiency in scripting or programming languages like Python, Go, or Bash.
- Experience with automation and configuration management tools like Terraform, Ansible, or Chef.
- Familiarity with monitoring and alerting tools such as Prometheus or Grafana.
- Adapting to a fast-paced, dynamic environment and managing multiple tasks and priorities effectively.
Career Level - IC5
#J-18808-LjbffrSimilar Jobs
- View Job
Cloud Reliability Engineer
Chantilly - View Job
Principal Site Reliability Engineer @ Chameleon Consulting Group
Herndon - View Job
DevOps Site Reliability Engineer
Reston - View Job
AWS Site Reliability Engineer
McLean - View Job
Sovereign Cloud Senior DevOps Engineer - SAP SuccessFactors (HCM)
Reston