Senior Engineer, Site Reliability Engineering (SRE)

Company: Balyasny Asset Management L.P.

Location: Chicago

Closing Date: 26/10/2024

Salary: £150 - £200 Per Annum

Hours: Full Time

Type: Permanent

Apply Now

Job Requirements / Description

We are looking for a Senior Site Reliability Engineer who can cultivate our SRE philosophy, processes, and technologies from the ground up.

As a Senior Site Reliability Engineer within the Platform group, you will lay the groundwork for our SRE infrastructure. Your role will entail driving standards and fostering adoption across our technology teams, whilst closely partnering with our DevOps and Cloud teams.

With a hands-on approach, you'll work across both cloud and on-premises hosting platforms, ensuring the reliability and scalability of our trading systems and production environments. This is a chance to play a pivotal role in transforming our operational capabilities and enhancing performance across a wide array of environments and platforms.

As a Site Reliability Engineer at BAM, you will:

Develop and promote our SRE philosophy, establishing best practices and processes that will be instrumental in scaling our infrastructure.
Create and maintain thorough documentation for SRE processes, systems design, and incident post-mortems to foster a culture of learning and improvement.
Drive adoption of SRE principles across various technology teams, acting as a mentor and advisor to embedded SREs.
Implement end-to-end observability and monitoring solutions using Prometheus, Grafana, Loki, and AWS CloudWatch, ensuring high visibility into application performance and infrastructure health.
Utilize and build standards around Sentry for application monitoring and error tracking to proactively identify and address reliability issues.
Review and define standards for application reliability requirements within our Kubernetes environment, ensuring application configuration is optimized for performance, cost and reliability.
Develop automation and tooling to improve efficiency and reliability of deployment pipelines, system health checks, and recovery procedures.
Collaborate with development teams to enhance service stability, scalability, and fault tolerance through SRE best practices like blameless post-mortems and service level objectives (SLOs).
Conduct a regular review of the infrastructure and application metrics, logs, and traces to proactively spot and address potential issues before they affect customers.
Introduce a reliability by default approach to software delivery.

Core Tech Stack:

Languages: Python, Java, NodeJS, C#, Shell
Public cloud: AWS
CI/CD: TeamCity, Octopus, Jenkins
Configuration Management: Puppet, Ansible
Infrastructure Code: Terraform, CloudFormation
Application Management: Kubernetes, Docker, Helm
OS: Linux and Windows
Observability: Prometheus, Amazon CloudWatch, Sentry, Grafana, Loki

To be considered a good cultural fit, you must be:

An ambitious self-starter
Hungry to learn
Driven towards success
A very strong and efficient communicator
Able to multi-task and excel in a fast-paced trading environment
A problem solver; able to develop quick and sound solutions to complex problems

To be considered a good fit, you must have:

5+ years of experience in SRE or similar roles within complex, distributed systems environments.
A Bachelor’s degree in engineering, computer science, information systems, or equivalent experience
Proficient with key SRE technologies such as Prometheus, Grafana, Loki, AWS CloudWatch, and Sentry.
Extensive knowledge of container orchestration using Kubernetes and containerization with Docker.
Hands-on experience with both cloud (AWS preferred) and on-premises hosting platforms.
Proven ability to script in languages like Python, Bash, or Go, to automate routine tasks and deployment pipelines.
Strong understanding of CI/CD principles, agile methodologies, and DevOps culture.
Excellent troubleshooting and problem-solving skills, with a systematic approach to handle unexpected situations.
High level of initiative, passion for reliability engineering, detail orientation, and follow-through capabilities.
Exceptional interpersonal and communication skills, with the ability to explain complex technical concepts to a diverse audience.
Experience with immutable infrastructure, infrastructure automation and provisioning tools, such as AWS CloudFormation or Terraform
Strong knowledge of Linux administration particularly RHEL and CentOS
Strong knowledge of distributed systems concepts, including best practices and troubleshooting
Knowledge of Windows Server administration and automation with PowerShell
Operational understanding of networking concepts, architecture, and best practices, especially as it relates to hybrid cloud integration
Analytical skills – Ability to troubleshoot and logically assess problems and determine solutions
Detailed documentation skills – ability to represent ideas, requirements, reference architecture and problems in clear, concise, and business-friendly documents

Bonus points for:

Experience in a high throughput/low latency environment
Experience with successful SRE team build outs
Experience with security patterns and distributed authentication
Experience managing high-pressure incident response
Experience with Chaos Engineering technologies
Contributions to open source libraries, projects, or communities
Any AWS, Azure, or GCP resource specializations or certifications
Any Kubernetes resource specializations or certifications

Don’t have all the skills listed above? Have extra skills you think are important that we haven’t thought of? Please, let us know by applying and telling us a bit more about yourself and why you think you’re qualified!

#J-18808-Ljbffr

Apply Now

Share this job

Balyasny Asset Management L.P.

Useful Links

More Jobs in Chicago
Full Time Jobs in Chicago
Part Time Jobs in Chicago
Management Jobs
Engineering Jobs
Devops Jobs

Similar Jobs
Senior Engineer, Site Reliability Engineering (SRE)
Chicago
View Job
Site Reliability Engineer (SRE)_ Mandarin Speaking
Chicago
View Job
Site Reliability Engineer (SRE)_ Mandarin Speaking
Chicago
View Job
Senior Data Reliability Engineer (Data SRE)
Chicago
View Job
Senior Site Reliability Engineer
Chicago
View Job

Senior Engineer, Site Reliability Engineering (SRE)

Similar Jobs