Job description
Hybrid Schedule; Onsite at either Needham or Marlborough Data Centers when needed
General Summary/Overview Statement
Summarize the nature and level of work performed
Mass General Brigham, Division of Research Computing, Enterprise Research Infrastructure and Services (ERIS http://rc.partners.org) is immediately seeking a candidate for the position of High-Performance Computing (HPC) System Engineer. Our goal is to support the computational and analytics requirements of our user community of over 3000 academic researchers and clinicians at Mass General Brigham member hospitals. Within the ERIS Scientific Computing team, this position is responsible for supporting users of our Linux HPC clusters. The role is challenging and varied, requiring technical, interpersonal and problem-solving abilities.
The successful candidate will be responsible for supporting research and clinical workloads, scientific applications, data sources on the environment of the computing cluster and in general assisting the user community in its use. The ERISOne/ERISTwo platform provides numerous scientific applications and large-memory systems, and as well as the ERISXdl NVIDIA GPU Linux Cluster that is equipped for machine learning applications. The candidate will be an expert on key technical of Linux system administration, scientific software installation and monitoring and troubleshooting performance issues relating to software. Extensive experience using comparable computing clusters is essential. Expanding the implementation of solutions for different pipelines and workloads of hundreds of groups across the MGB system. Working with the Information security and Privacy teams and the HPC Systems Operations Administrator to ensure systems and procedures adhere to organizational security standards, values, HIPAA and GCDP and other guidelines.
The position requires a deep knowledge of the technologies of the Scientific Computing department, versatility and breadth in technical skills, demonstrated ability to work independently, effective communication with team and management, outstanding customer service skills, and the ability to help teams and projects successfully accomplish their research objectives. Ideal candidates thrive on variety and innovation in their daily work, on interaction with customers who are world-renowned leaders in their scientific field, and on working with a wide range of technologies in a decentralized non-standard environment (academic).
ERIS enables and supports the highly successful and innovative research programs of the largest teaching hospitals in the nation -Massachusetts General Hospital (MGH), Brigham and Women’s (BWH), McLean and Spaulding Rehabilitation Hospitals- with their more than 3200 grant-sponsored programs in the biomedical sciences, from basic to clinical and applied research
Principal Duties and Responsibilities
Indicate key areas of responsibility, major job duties, special projects and key objectives for this position. These items should be evaluated throughout the year and included in the written annual evaluation.
- Cluster and Systems Administration: Manage and administer production systems used by researchers and Research Centers.
- Ansible Automation – Code refactoring to deploy and maintain systems and applications in Ansible templates.
- Analyzes result of server monitoring and implement changes to improve performance, processing and utilization. Proposes, maintains and enforces polices, practices and security procedures.
- Work with users to deploy required applications and docker/singularity applications.
- Analyze and resolve customer and technical problems: Tuning cluster scheduling parameters, memory/CPU contention, scientific application compilation and run-time issues.
- Develop and maintain system documentation as well as user-facing knowledge base articles and how-to guides.
- Use the Mass General Brigham values to govern decisions, actions and behaviors. These values guide how we get our work done: Patients, Affordability, Accountability & Service Commitment, Decisiveness, Innovation & Thoughtful Risk; and how we treat each other: Diversity & Inclusion, Integrity & Respect, Learning, Continuous Improvement & Personal Growth, Teamwork & Collaboration.
- Evaluate, select and deploy hardware and/or cloud solutions for research scientific computing. This includes CPU and GPU-based compute, high speed networking and data storage.
- Perform other duties as required by the situation and circumstances.
- Comfortable working within an Agile team (Slurm).
Qualifications (MUST be realistic, neither overstated nor understated, and related or the essential function of the job).
- BA/BS engineering degree in a quantitative field or system administration required or equivalent combination of skills/experience.
- 5+ years minimum experience in working with systems administration on environments in Linux environments for a scientific domain including NVIDIA GPU implementations.
- 3+ years of experience with automation and configuration management using Ansible.
- 3+ Docker and Kubernetes experience
- A combination of education and experience may be substituted for requirements.
Skills/Abilities/Competencies Required (MUST be realistic, neither overstated nor understated, and related to the essential function of the job).
- Demonstrated ability in providing systems administration of up to several hundred Linux servers in an on-premise environment.
- Hands-on experience writing, maintaining Ansible code.
- Strong skills writing Linux shell scripts in (Bash).
- Experience with monitoring software such as open-Xmode or Prometeus.
- Experience with server deployment technologies (kickstart, PXE, IPMI).
- Understanding of DHCP, DNS, TCP/IP, NFS, SMB and HTTP network protocols.
- Strong verbal and written communication, ability to write clear technical documentation.
- High level of initiative and eagerness to learn new technologies.
- Familiarity with information technology security and data privacy considerations applicable to a healthcare environment is advantageous.
- Knowledge of HPC job scheduling platforms like LSF or Slurm.
- Experience of Git and Jira tools.
- Ability to multitask and prioritize work requirements, keeping team and management informed.
- Experience Kerberos authentication.
- Experience providing support to research investigators with diverse computing needs.
Working Conditions Describe the conditions in which the work is performed
- Standard office environment.
- Travel to remote buildings required, consisting of onsite work around the Massachusetts General/Brigham and Women’s/McLean Hospitals campuses and the MGH Data Centers at Needham and Marlborough.
- As projects and priorities dictate, may be required to work occasional non-standard hours to support major projects
smogtown13.com is the go-to platform for job seekers looking for the best job postings from around the web. With a focus on quality, the platform guarantees that all job postings are from reliable sources and are up-to-date. It also offers a variety of tools to help users find the perfect job for them, such as searching by location and filtering by industry. Furthermore, smogtown13.com provides helpful resources like resume tips and career advice to give job seekers an edge in their search. With its commitment to quality and user-friendliness, smogtown13.com is the ideal place to find your next job.