Senior Engineer, Site Reliability Engineering (SRE)

Company:  Balyasny Asset Management L.P.
Location: Chicago
Closing Date: 26/10/2024
Salary: £150 - £200 Per Annum
Hours: Full Time
Type: Permanent
Job Requirements / Description

We are looking for a Senior Site Reliability Engineer who can cultivate our SRE philosophy, processes, and technologies from the ground up.

As a Senior Site Reliability Engineer within the Platform group, you will lay the groundwork for our SRE infrastructure. Your role will entail driving standards and fostering adoption across our technology teams, whilst closely partnering with our DevOps and Cloud teams.

With a hands-on approach, you'll work across both cloud and on-premises hosting platforms, ensuring the reliability and scalability of our trading systems and production environments. This is a chance to play a pivotal role in transforming our operational capabilities and enhancing performance across a wide array of environments and platforms.

As a Site Reliability Engineer at BAM, you will:

  • Develop and promote our SRE philosophy, establishing best practices and processes that will be instrumental in scaling our infrastructure.
  • Create and maintain thorough documentation for SRE processes, systems design, and incident post-mortems to foster a culture of learning and improvement.
  • Drive adoption of SRE principles across various technology teams, acting as a mentor and advisor to embedded SREs.
  • Implement end-to-end observability and monitoring solutions using Prometheus, Grafana, Loki, and AWS CloudWatch, ensuring high visibility into application performance and infrastructure health.
  • Utilize and build standards around Sentry for application monitoring and error tracking to proactively identify and address reliability issues.
  • Review and define standards for application reliability requirements within our Kubernetes environment, ensuring application configuration is optimized for performance, cost and reliability.
  • Develop automation and tooling to improve efficiency and reliability of deployment pipelines, system health checks, and recovery procedures.
  • Collaborate with development teams to enhance service stability, scalability, and fault tolerance through SRE best practices like blameless post-mortems and service level objectives (SLOs).
  • Conduct a regular review of the infrastructure and application metrics, logs, and traces to proactively spot and address potential issues before they affect customers.
  • Introduce a reliability by default approach to software delivery.

Core Tech Stack:

  • Languages: Python, Java, NodeJS, C#, Shell
  • Public cloud: AWS
  • CI/CD: TeamCity, Octopus, Jenkins
  • Configuration Management: Puppet, Ansible
  • Infrastructure Code: Terraform, CloudFormation
  • Application Management: Kubernetes, Docker, Helm
  • OS: Linux and Windows
  • Observability: Prometheus, Amazon CloudWatch, Sentry, Grafana, Loki

To be considered a good cultural fit, you must be:

  • An ambitious self-starter
  • Hungry to learn
  • Driven towards success
  • A very strong and efficient communicator
  • Able to multi-task and excel in a fast-paced trading environment
  • A problem solver; able to develop quick and sound solutions to complex problems

To be considered a good fit, you must have:

  • 5+ years of experience in SRE or similar roles within complex, distributed systems environments.
  • A Bachelor’s degree in engineering, computer science, information systems, or equivalent experience
  • Proficient with key SRE technologies such as Prometheus, Grafana, Loki, AWS CloudWatch, and Sentry.
  • Extensive knowledge of container orchestration using Kubernetes and containerization with Docker.
  • Hands-on experience with both cloud (AWS preferred) and on-premises hosting platforms.
  • Proven ability to script in languages like Python, Bash, or Go, to automate routine tasks and deployment pipelines.
  • Strong understanding of CI/CD principles, agile methodologies, and DevOps culture.
  • Excellent troubleshooting and problem-solving skills, with a systematic approach to handle unexpected situations.
  • High level of initiative, passion for reliability engineering, detail orientation, and follow-through capabilities.
  • Exceptional interpersonal and communication skills, with the ability to explain complex technical concepts to a diverse audience.
  • Experience with immutable infrastructure, infrastructure automation and provisioning tools, such as AWS CloudFormation or Terraform
  • Strong knowledge of Linux administration particularly RHEL and CentOS
  • Strong knowledge of distributed systems concepts, including best practices and troubleshooting
  • Knowledge of Windows Server administration and automation with PowerShell
  • Operational understanding of networking concepts, architecture, and best practices, especially as it relates to hybrid cloud integration
  • Analytical skills – Ability to troubleshoot and logically assess problems and determine solutions
  • Detailed documentation skills – ability to represent ideas, requirements, reference architecture and problems in clear, concise, and business-friendly documents

Bonus points for:

  • Experience in a high throughput/low latency environment
  • Experience with successful SRE team build outs
  • Experience with security patterns and distributed authentication
  • Experience managing high-pressure incident response
  • Experience with Chaos Engineering technologies
  • Contributions to open source libraries, projects, or communities
  • Any AWS, Azure, or GCP resource specializations or certifications
  • Any Kubernetes resource specializations or certifications

Don’t have all the skills listed above? Have extra skills you think are important that we haven’t thought of? Please, let us know by applying and telling us a bit more about yourself and why you think you’re qualified!

#J-18808-Ljbffr
Apply Now
Share this job
Balyasny Asset Management L.P.
  • Similar Jobs

  • Senior Engineer, Site Reliability Engineering (SRE)

    Chicago
    View Job
  • Site Reliability Engineer (SRE)_ Mandarin Speaking

    Chicago
    View Job
  • Site Reliability Engineer (SRE)_ Mandarin Speaking

    Chicago
    View Job
  • Senior Data Reliability Engineer (Data SRE)

    Chicago
    View Job
  • Senior Site Reliability Engineer

    Chicago
    View Job
An error has occurred. This application may no longer respond until reloaded. Reload 🗙