| SRE with GPU : TX at Dallas, Texas, USA |
| Email: [email protected] |
|
http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=2096570&uid= From: Chandra N, Siri Info [email protected] Reply to: [email protected] Role name:Engineer | Role Description:- This SRE role will primarily involve learning GPU clusters, assisting in bringing up these systems, and developing automation to keep them operational, as well as working with various other DC GPU teams to incorporate requirements and address any issues on the systems- Specific responsibilities o working with the platform engineering team to develop an automate management of an infrastructure control panel unemployment system for GPU clusterso working with the release engineering team to automate the application of updates and system configuration management toolso resolution of problem tickets reported by internal and external customers for GPU cluster systemso develop and enhance internal and 3rd party network and cluster management tools, applications, and processes that enable internal teams and clusters to build, test, optimize high performance networks supporting large scale GPU clusterso assist in developing these software ecosystem needed for at scale cluster operations providing cluster as a service for internal and customer access systems. This responsibility includes some involvement with rakan stack data center operations, add skill software install and configuration management, and add scale system provisioning helping to build and operate an on Prem cloud service for internal stakeholders that form a model for customer adoptiono helping to create an enterprise class operational model for internal cluster systems that provide or reliable, secure, automated infrastructure for rapid response to changing requirements, efficient use of assets, and a reference template for customer adoptiono participate in a strong customer centric culture focused on meeting commitments | Competencies:Digital : Python, Digital : Site Reliability Engineering (SRE) | Experience (Years):8-10 | Essential Skills:Job Description SRE will be responsible for helping to create and automate processes that bring up and keep deployed GPU cluster system running.This position will be focused on the operational aspects of large-scale GPU-accelerated AI and HPC Cluster systems.SRE will work closely with CPE and DCOps teams closely as internal and external systems are brought up for customers.Roles & Responsibilities - This SRE role will primarily involve learning GPU clusters, assisting in bringing up these systems, and developing automation to keep them operational, as well as working with various other DC GPU teams to incorporate requirements and address any issues on the systems- Specific responsibilities o working with the platform engineering team to develop an automate management of an infrastructure control panel unemployment system for GPU clusterso working with the release engineering team to automate the application of updates and system configuration management toolso resolution of problem tickets reported by internal and external customers for GPU cluster systemso develop and enhance internal and 3rd party network and cluster management tools, applications, and processes that enable internal teams and clusters to build, test, optimize high performance networks supporting large scale GPU clusterso assist in developing these software ecosystem needed for at scale cluster operations providing cluster as a service for internal and customer access systems. This responsibility includes some involvement with rakan stack data center operations, add skill software install and configuration management, and add scale system provisioning helping to build and operate an on Prem cloud service for internal stakeholders that form a model for customer adoptiono helping to create an enterprise class operational model for internal cluster systems that provide or reliable, secure, automated infrastructure for rapid response to changing requirements, efficient use of assets, and a reference template for customer adoptiono participate in a strong customer centric culture focused on meeting commitmentsExperience & Qualifications - 10 + years experience in high performance networks, platform hardware, firmware, and system management solutions at scale- strong Linux admin knowledge and skills around installation configuration package man-agement and system management across multiple OS distributions. Related skill in system performance tuning at user and kernel mode is a plus- experience with virtualization and containerization including systems like KVM, docker, podman, open shift, Kubernetes- strong experience with system automation and configuration management at scale using tools like ansible salt, chef, puppet, bash, Python- experience working with dev teams developing and maintaining our CI CD pipeline devel-opment environment- experience using common industry tools to fix software issues and automate operational processes- Strong networking knowledg | Desirable Skills:Job Description SRE will be responsible for helping to create and automate processes that bring up and keep deployed GPU cluster system running.This position will be focused on the operational aspects of large-scale GPU-accelerated AI and HPC Cluster systems.SRE will work closely with CPE and DCOps teams closely as internal and external systems are brought up for customers.Roles & Responsibilities - This SRE role will primarily involve learning GPU clusters, assisting in bringing up these systems, and developing automation to keep them operational, as well as working with various other DC GPU teams to incorporate requirements and address any issues on the systems- Specific responsibilities o working with the platform engineering team to develop an automate management of an infrastructure control panel unemployment system for GPU clusterso working with the release engineering team to automate the application of updates and system configuration management toolso resolution of problem tickets reported by internal and external customers for GPU cluster systemso develop and enhance internal and 3rd party network and cluster management tools, applications, and processes that enable internal teams and clusters to build, test, optimize high performance networks supporting large scale GPU clusterso assist in developing these software ecosystem needed for at scale cluster operations providing cluster as a service for internal and customer access systems. This responsibility includes some involvement with rakan stack data center operations, add skill software install and configuration management, and add scale system provisioning helping to build and operate an on Prem cloud service for internal stakeholders that form a model for customer adoptiono helping to create an enterprise class operational model for internal cluster systems that provide or reliable, secure, automated infrastructure for rapid response to changing requirements, efficient use of assets, and a reference template for customer adoptiono participate in a strong customer centric culture focused on meeting commitmentsExperience & Qualifications - 10 + years experience in high performance networks, platform hardware, firmware, and system management solutions at scale- strong Linux admin knowledge and skills around installation configuration package man-agement and system management across multiple OS distributions. Related skill in system performance tuning at user and kernel mode is a plus- experience with virtualization and containerization including systems like KVM, docker, podman, open shift, Kubernetes- strong experience with system automation and configuration management at scale using tools like ansible salt, chef, puppet, bash, Python- experience working with dev teams developing and maintaining our CI CD pipeline devel-opment environment- experience using common industry tools to fix software issues and automate operational processes- Strong networking knowledg | Country:United States | Branch | City | Location:TCS - Dallas, TX Plano Plano, TX | Keywords: continuous integration continuous deployment artificial intelligence Texas SRE with GPU : TX [email protected] http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=2096570&uid= |
| [email protected] View All |
| 04:56 PM 21-Jan-25 |