Sr. SRE / Kubernetes Engineer

Company:  Stratitech Services LLC
Location: San Francisco
Closing Date: 29/10/2024
Hours: Full Time
Type: Permanent
Job Requirements / Description

Job Title: Sr. SRE/ Kubernetes Engineer

Location: San Francisco, CA (Hybrid – 2 days in-office) Must be currently local to SF Bay Area.


About the Role:

StratITech is seeking a Sr. Site Reliability Engineer / Kubernetes Engineer for our client based in San Francisco, CA . This is a full-time position offering competitive pay and stock options . We are only accepting applicants who are US Citizens or US Permanent Residents/Green Card Holders . No C2C or third-party applications will be considered. Must be local to the SF Bay Area, we do not do relocation.


In this hybrid role, you will be working two days a week in-office, as part of a dynamic team responsible for deploying, managing, optimizing, and upgrading the systems that support innovative software solutions.


This person must be excited about working in an interrupt-driven startup environment . The ideal candidate will be passionate about learning new technologies, solving complex problems, and embracing Infrastructure as Code (IaC) to automate infrastructure processes. Your role will involve collaborating closely with team members to address architectural challenges and ensure the reliability and efficiency of the client’s cloud infrastructure.


Implementation is key in this role, as you’ll be directly responsible for turning ideas into reliable and scalable solutions.


Key Responsibilities:

  • Cloud Operations: Leverage DevOps principles to provide technical operational support, including production operational support , for cloud infrastructure operations for internal and external customers.
  • Tool Development & CI/CD: Write CI/CD pipelines from scratch and build tools that support internal platforms, improving stability, reliability, and efficiency.
  • Feature Flags & Modifications: Implement and manage feature flags, enabling or modifying features as necessary to support platform flexibility and customer requirements.
  • Troubleshooting: Diagnose and resolve complex system problems across the entire technology stack, including CI/CD pipelines, container-based systems, networking, operating systems, cloud resources, and databases. Must have very strong troubleshooting skills .
  • Monitoring & Alerting: Implement and manage monitoring and alerting infrastructure for critical services, ensuring stability and performance across all platform components.
  • Automation & Runbooks: Create, revise, and test operational runbooks and automation scripts to maintain infrastructure efficiently and securely.
  • Operational Innovation: Proactively seek opportunities for innovation to enhance operational processes, increasing reliability, availability, and performance while promoting a security-first culture.
  • On-Call Support: Participate in an on-call rotation (7am-7pm, 7 days a week, every three weeks rotating) to support 24/7 operations and ensure system availability.
  • Documentation: A willingness and desire to author technical documentation for design, workflows, processes, and best practices.
  • External Customer Focus: Provide direct support for external customer requirements , ensuring that solutions align with customer needs and expectations.
  • Quality & Security: Embody a Quality-first & Security-first culture in all that you do.


Must-Have Requirements:

  • 5+ years of experience with Azure (or AWS /GCP ) for cloud infrastructure.
  • Strong experience with Terraform for infrastructure automation.
  • Strong experience with Kubernetes in production .
  • Proficiency in Helm for managing Kubernetes applications.
  • 5+ years of coding experience in Python .
  • Experience using Infrastructure as Code (IaC) and CI/CD tools like FluxCD , Jenkins , Terraform , or GitHub .
  • Strong experience with Linux operating systems.
  • Solid working knowledge of networking (TCP/IP, DNS) and cloud infrastructure performance.
  • Operational experience with monitoring/alerting systems such as Sentry , Opsgenie , or Prometheus .
  • Must have production operations and client-facing experience .
  • Willingness to mentor junior team members and contribute to technical documentation for workflows and best practices.
  • Hands-on problem-solver with the ability to balance risk and impact to customers.


These skills are a plus:

  • Experience with elements of the current tech stack: FluxCD , Prometheus , Elasticsearch , Java , Kafka , Postgres , and Jenkins .
  • Previous experience or a keen interest in industrial IoT , analytics , or manufacturing .

Apply Now
Share this job
Stratitech Services LLC
  • Similar Jobs

  • SRE Engineer

    San Francisco
    View Job
  • Sr. Engineer

    San Francisco
    View Job
  • Platform Systems Engineer (2nd shift/ 100% Remote- Storage/Virtualization/Kubernetes)

    San Francisco
    View Job
  • SRE with Oracle Apps DBA

    San Francisco
    View Job
  • Sr. Data Engineer

    San Francisco
    View Job
An error has occurred. This application may no longer respond until reloaded. Reload 🗙