Company:
Saxon Global
Location: Dallas
Closing Date: 03/11/2024
Hours: Full Time
Type: Permanent
Job Requirements / Description
Job Summary:
We are looking for a Site Reliability Engineer (SRE) who will be responsible for ensuring the reliability, availability, and performance of our production systems. As an SRE, you will work closely with cross development and engineering teams to design and implement tools and processes to automate deployment, observability, and troubleshooting of our applications and infrastructure supporting the deployment of new Android tablets to the stores.
This individual must be skilled and have professional experience with the core functions of Site Reliability Engineering including deployments, observability, monitoring, telemetry, and automation.
Please be sure to call out your experience in these areas and how your technical experience matches the requirements below in your resume.
Responsibilities:
Ensure the reliability, availability, and performance of our production systems as we scale
Develop and maintain monitoring and alerting systems to detect and respond to incidents in a timely manner
There is no on-call rotation but occasionally support planned deployment roll outs that may require working off-hours during store closure
Work with cross-functional teams to plan and execute scaling initiatives
Develop and maintain documentation of processes, procedures, and technical configurations
Requirements:
Strong written and verbal communication skills with peers, technical leads, project managers and product owners
Must be able to collaborate with customers and cross-functional teams to design, test and validate deliverable which meet or exceed expectations
Self-starter and highly motivated individual that is well-organized
Bachelor's degree in Computer Science or related field
5+ years of experience as a Site Reliability Engineer
Strong experience with automation tools and experience with automation scripting in Python
Experience with containerization technologies such as Docker and Kubernetes
Experience with cloud platforms such as Azure or AWS
Experience with monitoring and logging tools such as Datadog, Prometheus, Grafana or Splunk
Strong understanding of networking, security, and systems administration
Excellent problem-solving skills and attention to detail
Must be available to work core hours PST.
Preferred qualifications:
Experience with distributed systems and supporting a large retail business
Experience with infrastructure as code tools such as Terraform or CloudFormation
Experience with CI/CD tools such as Jenkins
Experience with incident ticketing systems such as ServiceNow and Jira for tracking stories
Familiarity with Agile/Scrum methodologies and DevOps principles
If you are passionate about ensuring the reliability and availability of systems in our stores and enjoy collaborating with cross-functional teams to solve complex problems, we encourage you to apply for this exciting opportunity as an SRE.
We are looking for a Site Reliability Engineer (SRE) who will be responsible for ensuring the reliability, availability, and performance of our production systems. As an SRE, you will work closely with cross development and engineering teams to design and implement tools and processes to automate deployment, observability, and troubleshooting of our applications and infrastructure supporting the deployment of new Android tablets to the stores.
This individual must be skilled and have professional experience with the core functions of Site Reliability Engineering including deployments, observability, monitoring, telemetry, and automation.
Please be sure to call out your experience in these areas and how your technical experience matches the requirements below in your resume.
Responsibilities:
Ensure the reliability, availability, and performance of our production systems as we scale
Develop and maintain monitoring and alerting systems to detect and respond to incidents in a timely manner
There is no on-call rotation but occasionally support planned deployment roll outs that may require working off-hours during store closure
Work with cross-functional teams to plan and execute scaling initiatives
Develop and maintain documentation of processes, procedures, and technical configurations
Requirements:
Strong written and verbal communication skills with peers, technical leads, project managers and product owners
Must be able to collaborate with customers and cross-functional teams to design, test and validate deliverable which meet or exceed expectations
Self-starter and highly motivated individual that is well-organized
Bachelor's degree in Computer Science or related field
5+ years of experience as a Site Reliability Engineer
Strong experience with automation tools and experience with automation scripting in Python
Experience with containerization technologies such as Docker and Kubernetes
Experience with cloud platforms such as Azure or AWS
Experience with monitoring and logging tools such as Datadog, Prometheus, Grafana or Splunk
Strong understanding of networking, security, and systems administration
Excellent problem-solving skills and attention to detail
Must be available to work core hours PST.
Preferred qualifications:
Experience with distributed systems and supporting a large retail business
Experience with infrastructure as code tools such as Terraform or CloudFormation
Experience with CI/CD tools such as Jenkins
Experience with incident ticketing systems such as ServiceNow and Jira for tracking stories
Familiarity with Agile/Scrum methodologies and DevOps principles
If you are passionate about ensuring the reliability and availability of systems in our stores and enjoy collaborating with cross-functional teams to solve complex problems, we encourage you to apply for this exciting opportunity as an SRE.
Share this job
Saxon Global
Useful Links