Job ID
APN7446
Company Overview
Global leader in e-commerce, known for providing cutting-edge fashion and lifestyle products. We are committed to fostering an innovative and dynamic work environment, offering our employees opportunities for growth and collaboration across various teams. Our mission is to scale efficiently while maintaining seamless operations, ensuring our customers experience uninterrupted access to our services.
Job Summary
We are seeking a Site Reliability Engineer (SRE) with experience in large-scale, mission-critical environments that require zero downtime. As part of our SRE team, you’ll work on maintaining and optimizing a robust database infrastructure, leveraging automation to ensure reliability, performance, and security. You'll design scalable solutions that meet our expanding data and business needs. This is a hands-on role that requires collaboration across multiple teams and participation in an on-call rotation to support our production systems.
Responsibilities and Duties
- Collaborate with cross-functional teams to ensure proper toolsets for generating, collecting, analyzing, visualizing, and alerting operational data.
- Own and operate critical open-source services, including Elasticsearch, Kafka, RabbitMQ, and Redis.
- Design and build tools that improve observability, system resiliency, and platform performance.
- Proactively manage and triage site availability incidents, working to minimize mean time to recovery (MTTR) for critical customer-impacting events.
- Partner with service owners to define and implement Service Level Metrics (SLMs) and Service Level Objectives (SLOs).
- Document technical processes, network diagrams, and runbooks to enhance efficiency and improve the reliability of the infrastructure.
- Participate in 24/7/365 on-call rotation to ensure continuous system availability.
Qualifications and Skills
- Bachelor’s degree in Computer Science, Information Systems, or a related field (or foreign equivalent).
- Bilingual proficiency in Mandarin and English is required due to the company's global presence and international communication.
- Minimum 4-7 years of experience in mission-critical, real-time, high-traffic applications in cloud environments.
- Proficiency in cloud systems, continuous integration, Java, SQL/NoSQL databases, and observability tools such as Grafana, Prometheus, or Zabbix.
- Experience in scripting/programming (Python, GoLang) and container technologies like Docker, Kubernetes, or Mesos.
- Knowledge of open-source technologies (Elasticsearch, Kafka, Redis) is essential.
Benefits and Perks
- Competitive bonus and RSU offerings.
- Comprehensive healthcare (medical, dental, vision, prescription).
- Health Savings Account with employer contributions.
- Flexible Spending Accounts for healthcare and dependent care.
- Company-paid life and disability insurance.
- Voluntary benefits (Critical Illness, Accident, Hospital Indemnity).
- Employee Assistance Program (EAP) and Business Travel Accident Insurance.
- 401(k) plan with discretionary company match and financial advisory access.
- Generous paid time off (vacation, holidays, sick days, and floating holidays).
- Employee discounts and free weekly catered lunch.
- Dog-friendly office, gym access (in select locations).
- Free snacks, beverages, and company swag.
- Invitations to company events and annual holiday parties.