Job Description
Job Description
HPC Systems Engineer
Remote - but must be local to DC Metro area for some onsite meetings
Work with a 4000+ core HPC cluster that is GPU-focused and a 1,500+ HPC cluster supporting the hardware and operating system environments
Ability to translate technical concepts in HPC and research computing to scientists and other non- technical personnel
Ability to determine meaningful metrics and usage data for leadership
Supporting bioinformatics applications for a large and diverse research community with needs in genomics, cryo-electron microscopy, and AI/ML
Monitor the portfolio of software applications and be proactive in planning upgrades and license renewals
Monitor and report on cluster performance and generate data to show usage and trends
Triage support requests from the research community and work with others in the Scientific Infrastructure team to resolve issues and complete service requests
Collaborate with researchers to guide them in effective use of the HPC resources, such as job scheduler submission, data formats, and building data workflows
Engage with researchers to understand their HPC needs to include data life cycle management, integration of scientific instruments to HPC, and storage capacity and compute requirements
Provide input to the Scientific Infrastructure team leader for setting priorities for cluster operations, scheduling policies, resources needed, etc.
Attend and actively participate in daily standup meetings to provide updates on progress, discuss obstacles, and co-ordinate tasks with other team members
Education:
BS/BA (or equivalent)
Required Experience:
Five years of related experience
Required Technical Skills:
Minimum of five years of experience with servers, datacenters, networking, and related technologies
Minimum of five years of experience managing Linux systems
Experience with Spack package manager, including making packages from PyPi, R, Github
Experience installing and packaging GPU applications and optimizing job submission scripts that are used for ML model training, data mining operations, or high-res graphics rendering
Experience with Python scripting
Experience using Git distributed workflows
Experience with Ansible manage system configuration
Experience with Terraform for provisioning systems