Epicareer Might not Working Properly
Learn More

Head of System and Operation

  • Full Time, onsite
  • Millennium Technology Services Sdn Bhd
  • Kuala Lumpur, Malaysia
RM 10,000 - RM 12,999 / Per Mon

Checking job availability...

Original
Simplified
Position Overview: We are a GPU computing service provider seeking a technical management professional with astrong technical background and proven management capabilities. The ideal candidate is atechnical expert who can lead and manage a small, high-level technical team of fewer than 10members, primarily composed of mid-level technical staff. This role requires hands-oninvolvement in key technical tasks and focuses on the construction and operation of computecenter infrastructure, emphasizing the design and management of the underlying architecture. It isnot a position dedicated to optimizing infrastructure for LLM models or accelerating AI trainingand inference communications. Familiarity with NVIDIA GPUs and InfiniBand is a strong plus,but not a mandatory requirement. Job Requirements: 1. Team Management and Hands-On Capability o Proven experience managing small to medium-sized teams and leading complexprojects. o Must have hands-on capabilities to delve into technical details and solve real-worldproblems. 2. Experience and Background o Must have practical operational experience and the ability to execute tasksindependently. o Must have hands-on experience deploying and managing x86 server Kubernetesclusters with a minimum scale of 1,000 servers within the last 1-2 years. o Experience with deploying NVIDIA GPU clusters is preferred. 3. Technical Expertise o Technical skills are prioritized over management capabilities. o Proficiency in English and technical expertise are equally important, enablinginternational communication and collaboration. o Must be familiar with RoCE technology o Familiarity with the following technical areas is preferred: InfiniBand/ Data center design o Experience with operating Tier 3 (T3) or higher data centers (IDCs) is preferred. o Proficient in writing and maintaining operational scripts (Python preferred). Key Responsibilities: o Oversee the deployment, management, and optimization of compute center infrastructureand server clusters. o Participate deeply in data center design and drive the implementation of technicalarchitecture. o Write and maintain operational scripts (Python preferred) to ensure efficient systemperformance. o Resolve complex technical issues and provide guidance to the team. o Act as the core technical expert within a small team, ensuring the efficient and stableoperation of compute center infrastructure.