Senior Principal Software Developer - Cluster Networks (JoinOCI-SDE)

Company:  Oracle
Location: Santa Clara
Closing Date: 09/11/2024
Salary: £125 - £150 Per Annum
Hours: Full Time
Type: Permanent
Job Requirements / Description

Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high performance network required to support AI/ML/HPC workloads. This is your opportunity to join the AI revolution and designing systems which allow customers to scale from tens to thousands of GPU without compromising on performance.

This team will be responsible for designing, developing and performance tuning the software+hardware stack required to run distributed AI/ML/HPC workload across thousands of GPUs leveraging libraries like NCCL on high performance network.

This is your opportunity to build innovative solutions for our customers from the ground up. These are exciting times and our team is still young and growing fast, working on ambitious new initiatives. We are looking for adaptable, self-motivated engineers with ability to learn quickly. You should be both a rock solid developer and a distributed systems generalist, able to dive deep into any part of the stack and low-level systems, as well as design broad distributed system interactions. You should value simplicity and scale, work comfortably in a collaborative, agile environment, and be excited to learn.

Career Level - IC5


Basic Qualifications:

  • 10+ years of experience with software (systems/application) development
  • 2+ years of experience with collective communications libraries like NCCL, RCCL, MPI and GPU frameworks like CUDA and ROCm.
  • 2+ years of experience with ML training frameworks like PyTorch, TensorFlow
  • Proficient at programming in any two out of C/C++, Python, Java, Scala, GO
  • Proficient with data structures, algorithms, operating systems
  • Excellent organizational, verbal, and written communication skills
  • Bachelors in computer science and Engineering or related engineering fields

Preferred Qualifications:

  • Masters / PhD degree in Computer Science or related engineering fields
  • Experience with RDMA programming, including but not limited to GPUDirect RDMA
  • Experience with distributed workload managers like Slurm or K8s
  • Experience with Linux Performance tools
  • Experience in SDN, NFV, Cloud Networking
  • Experience in Infrastructure-as-a-Service, viz. OpenStack, AWS, GCP, Azure

#J-18808-Ljbffr
Apply Now
Share this job
Oracle
Oracle
  • Similar Jobs

  • Principal Software Developer

    Santa Clara
    View Job
  • Senior GPU Cluster Tools Developer

    Santa Clara
    View Job
  • Senior Principal Systems Engineer – 5G Core Networks

    San Jose
    View Job
  • Senior Principal Systems Engineer – 5G Core Networks

    San Jose
    View Job
  • Principal Software Engineer (Join OCI-SDE), Cloud Engineering

    Santa Clara
    View Job
An error has occurred. This application may no longer respond until reloaded. Reload 🗙