Senior Systems Engineer (High Performance Computing) at Paige
Paige is a software company helping pathologists and clinicians make faster, more informed diagnostic and treatment decisions by mining decades of data from the world’s experts in cancer care. We are leading a digital transformation in pathology by leveraging advanced Artificial Intelligence (AI) technology to create value for the oncology clinical team.
We are the first company to develop clinical grade AI tools for the pathologist, which resulted in our receiving FDA breakthrough designation for our first product. Paige has also received FDA-clearance for our digital viewer, FullFocus™. We have also established multiple relationships with biopharma, laboratory, and equipment manufacturers that enables Paige to develop an ecosystem ready to help patients receive better diagnoses and treatment.
We’re seeking an experienced Senior Systems Engineer (HPC) to administer and support our High Performance Computing cluster. You will work closely with engineering and data management teams on cutting-edge technologies.
This is an extraordinary opportunity to be part of a high-performing team and to pursue a life-changing mission with unique technical challenges!
- Design, plan, test and implement innovative hardware designs for an HPC environment
- Implement, support, and provide technical guidance for engineering team initiatives and projects
- Build automation for infrastructure provisioning, configuration management, and account access (emphasis on SaltStack)
- Install, provision, and support complex Cisco Nexus HPC switching environment (RoCE)
- Responsible for the design structure and maintenance of an Pure Storage and Qumulo enterprise network attached storage system (NAS).
- Regularly evaluate and recommend new tools and technologies for use in existing and future clusters
- Deploy patches and updates to operating systems and application software
Required Skills and Experience
- Master’s in Computer Science, engineering, information systems or related field, or equivalent years' experience
- 8+ years’ experience in systems engineer role
- Deep knowledge of server components CPU, SSD, GPU, Networking
- Deep knowledge of High Performance Computing (HPC) / Cluster technologies with high-speed interconnect fabrics using Ethernet/RoCE and Infiniband
- Expert knowledge of SAN and NAS services (iSCSI, NFS, CIFS)
- Expert knowledge of TCP/IP networking, network security, and DNS (BIND, Windows)
- Expert knowledge of Linux (Ubuntu, CentOS), common UNIX services, and Shell scripting
- Strong understanding of high speed HPC interconnects
- Strong knowledge of parallel GPU computing, MPI, and RDMA within containerized environments
- Strong knowledge of NVIDIA software environment, NCCL, NGC, GPU tools
- Strong experience working with operation and administration of workload schedulers such as Slurm, LSF, or PBS
- Strong knowledge of virtualization technologies such as KVM/libvirt/QEMU
- Experience working with configuration management tools like SaltStack, Chef, or Puppet
- Working knowledge of kubernetes and docker containers within an on-prem HPC cluster
- Understanding of data pipelines to include ETL and streaming data such as log data or tool/sensor data to indexes (EMR)
- Understanding of cloud platforms and services, particularly AWS
- Understanding of Jupyter Notebook technology
- Understanding of CI/CD pipelines
- Understanding of Agile development methodologies