Production Engineers at Covariant play a mission-critical role in ensuring our services' seamless operation and future scalability. In this role, you'll be at the forefront of every significant engineering endeavor embedded within our production and research teams. As a production engineer, you will drive innovation and efficiency in our projects by applying your expertise in AWS, Docker, Kubernetes, Puppet, and Terraform to architect scalable and resilient infrastructure for our innovative AI robotics systems.
AREAS OF FOCUS
- Own and orchestrate large GPU clusters across different cloud providers using IaaC and scripts to provide researchers with a single cohesive interface
- Help other teammates architect and build scalable tooling for our edge robot fleet
- Collaborate with brilliant researchers to evolve our training and inference tooling to be state-of-the-art
YOU WILL
- Design, build, manage and monitor the infrastructure we use to deploy our AI software and robotics solutions
- Develop and evolve software engineering and operational practices for the unique needs of distributed AI-powered cyber-physical systems
- Identify and establish healthy engineering and operational culture and processes
- Deliver previously impossible robotics capabilities that solve real needs for our partners and customers
- Collaborate with, learn from, and support a diverse and cross-functional team, including mechanical, electrical, and robotics engineers, AI/ML researchers, and business development
YOU HAVE
- Substantial previous experience in operating and automating production systems in both cloud and bare metal, deploying and administering Linux systems and/or wide-area networks, and building new tools and/or extending existing tools to add new capabilities
- A track record of accelerating developer productivity through improved tooling, automation, and education
- A track record of partnering with stakeholders to deliver solutions throughout the development process
- A solid foundation in Python, Linux, and networking
- Commitment to continuous learning and willingness to pick up new languages or technologies as needed, to solve real problems and deliver business impact
NICE TO HAVES
- Desire to work with a small collaborative team, with a high degree of autonomy and responsibility
- Are motivated to work on challenging real-world engineering problems without prior solutions
- Are excited to join coworkers who strive to be inclusive, thoughtful, and down-to-earth
- Are self-directed and enjoy figuring out what is the most important problem to work on
- Have previously done one or more of the following: deployed client-side software, including protecting source code, establishing secure licensing, and performing release engineering; or, set up and scaled developer tooling and CI/CD systems; or built ML or IoT data pipelines processing images and metadata from live deployments; or managed high-bandwidth deep learning or super-computing hardware
SAMPLE WEEK IN THE LIFE
- Monday: Start the week with a team meeting to discuss ongoing projects and explore potential collaborations. Resume work on the rollout of BigProxy v2 in the development environment, refining probing tests to enhance its reliability. Also, schedule a discussion with our Tailscale account representative to renew our contract.
- Tuesday: Address an urgent issue with the networking backplane of one of our GPU clusters not performing optimally. Conduct a troubleshooting session with the cluster provider to adjust the NCCL topology file, following unexpected changes on their end.
- Wednesday: Develop a new alert in Datadog to monitor the performance of the GPU cluster backplane, ensuring it is adaptable for use with various providers.
- Thursday: Collaborate with a colleague on deploying a PyPi server in our cloud infrastructure. Continue the implementation and testing of BigProxy v2 which was paused on Tuesday.
- Friday: Lead a presentation at the weekly engineering deep dive to discuss the features and potential rollout of BigProxy v2, which consolidates all connections from remote deployments to the cloud through a single channel and simplifies SSH access to GPU clusters outside AWS/GCP. Gather and incorporate feedback from the team to finalize the deployment strategy.
SALARY RANGE: $165,000 - $210,000 a year
Base pay is one element of our total rewards package which may also include comprehensive benefits and equity etc., depending on eligibility. The annual base salary range for this position is from $165,000 to $210,000. The actual base pay offered will be determined on factors such as years of relevant experience, skills, education etc. Decisions will be determined on a case-by-case basis.
#J-18808-LjbffrSimilar Jobs
- View Job
Production Engineer
Richmond - View Job
Software Engineer, Post Production Engineering
Emeryville - View Job
Software Engineer, Post Production Engineering
Emeryville - View Job
Software Engineer, Post Production Engineering
Emeryville - View Job
Software Engineer, Post Production Engineering
Emeryville