Staff Site Reliability Engineer

Company:  Moloco, Inc.
Location: Seattle
Closing Date: 23/10/2024
Salary: £150 - £200 Per Annum
Hours: Full Time
Type: Permanent
Job Requirements / Description

About the Role

Moloco is a machine learning company that operates at massive scale (we ingest 10 petabytes of training data per day), and our models are blazingly fast (return predictions in 10 milliseconds or less); and a profitable unicorn (we are valued at $2 billion and have been profitable for the last 17+ quarters).

We are looking for an exceptional Senior Site Reliability Engineer to help us build a state-of-the-art ML model serving infrastructure for our mobile advertising platform. You will be part of an engineering team that manages the infrastructure that serves deep neural network machine learning (ML) models to clients, CI/CD infrastructure to deploy infrastructure updates in real time, and develops infrastructure tools and platforms that improve the productivity of engineering teams.

We are looking for someone who is passionate about solving infrastructure problems with software engineering skills, a desire to grow and learn new technologies, a love of working in collaborative teams, and a commitment to customer service.

What you'll do

  • Play a role in engineering partner teams for company-wide infrastructure adoption and standard methodologies
  • Contribute to technical direction and decisions across the organization by conducting / leading research with other technical leaders in the organization
  • Traditional SRE/Operational support areas such as tooling and automation, monitoring, workflow management, maintaining and improving data pipelines, CI/CD, etc.
  • Actively participate in and contribute to code reviews and technical design documents to identify performance and reliability bottlenecks.
  • Partner with and support other engineering teams with operational guidance and expertise on various project initiatives.
  • Participate in capacity planning and scaling
  • Ensure that Moloco is delivered in a highly performant manner that can handle viral traffic spikes.
  • Collaborate with others in SRE and SWE to leverage tools, processes and techniques to improve service reliability.
  • Reduce business risk in areas such as infrastructure and configuration management, provisioning, capacity modeling and planning, and incident handling, mitigation, root cause analysis, and post-mortems.
  • Identify common patterns in the challenges of operating services in production, and work with others in SRE and SWE to design and implement reusable solutions and/or other cross-functional work that reduces the complexity, difficulty, cost, and risk of operating the business.

What you’ll need to succeed

  • Hands-on experience working with GCP or other cloud platforms (e.g. AWS, Azure)
  • Practical, proven knowledge of a high-level language (e.g. Go, Python)
  • Experience working with infrastructure-related software (e.g. Kubernetes, Helm, Terraform, etc.)
  • Experience developing infrastructure, configuration and deployment scripting and automation for large scale / high complexity services in a microservices environment
  • At least 5 years of experience in large-scale software development
  • Passionate about operational excellence and thrive in an environment where you are able to provide extremely high levels of customer support
  • Tenacious problem solver who takes ownership of issues from end-to-end to full resolution
#J-18808-Ljbffr
Apply Now
Share this job
Moloco, Inc.
  • Similar Jobs

  • Staff Site Reliability Engineer

    Seattle
    View Job
  • Staff Site Reliability Engineer

    Seattle
    View Job
  • Staff DevOps Engineer - Site Reliability Engineer

    Seattle
    View Job
  • Staff Site Reliability Engineer - Cloud Infrastructure

    Seattle
    View Job
  • Staff Site Reliability Engineer - Performance Engineering

    Seattle
    View Job
An error has occurred. This application may no longer respond until reloaded. Reload 🗙