Who We Are
At The Trade Desk, we recognize that a seamless customer experience is driven by operational excellence. In pursuit of constantly improving the reliability of our platform, we are establishing a global Systems Operations team. This team's core mission will be to vigilantly monitor The Trade Desk platform services, refine our incident response methodologies, and guarantee a robust and highly-available customer experience. If you're passionate about ensuring system reliability, process improvement, and making an essential customer impact, we invite you to play a critical role in this next evolution of our on-call experience.
What You'll Do
- Act as a technical expert and advisor to more junior Associate Systems Operations Engineers
- At an escalated tier, monitor the state of platform services and stability via telemetry and alerts; triage issues, escalate to engineering teams as needed
- Work collaboratively with development teams to facilitate issue remediation
- Manage remediation task workflow
- Proactively update and improve Systems Operations documentation and runbooks
- Increase the effectiveness of the incident response process by defining and measuring relevant metrics
- There may be periodic weekend coverage requirements
Who We are Looking For
- Bachelor’s Degree from a four-year university or relevant substitute experience
- 6+ years relevant work experience in Technical and/or Application Support with strong knowledge of services support and troubleshooting
The Systems Operations Engineer will either possess or be excited to learn a number of skills...
Technical Proficiency :
- Understanding of large-scale distributed system architectures (e.g., databases, web services, application services).
- Familiarity with monitoring tools (e.g., Prometheus, Grafana, Nagios).
- Ability to configure and fine-tune alerts.
- Proficiency or ability to learn programming languages including C# and SQL.
Incident Management and Troubleshooting :
- Ability to prioritize and manage incidents based on severity, with a focus on customer impact.
- Ability to remain calm under pressure and quickly diagnose issues.
- Understanding of system logs, metrics, telemetry.
Communication Skills :
- Ability to communicate effectively with stakeholders during an incident.
- Clear and concise documentation skills.
- Ability to maintain and update troubleshooting guides (TSGs) and operational documentation.
- Ability to translate complex technical issues and platform outages to non-technical stakeholders.
Automation & Scripting :
- Ability to automate repetitive tasks.
- Proficiency in scripting languages (e.g., Python, Bash) is a plus.