Epicareer Might not Working Properly
Learn More

AI Infrastructure & Systems Specialist

RM 5,000 - RM 8,000 / month

Checking job availability...

Original
Simplified

We are seeking a highly skilled and versatile AI Infrastructure & Systems Specialist to drive the deployment, orchestration, and management of our AI computing infrastructure. This role is critical in ensuring the smooth operation of our GPUaaS platform, optimizing AI workloads, and supporting customers in their AI adoption journey. The ideal candidate should have expertise in GPU systems, AI orchestration platforms, software engineering for AI, and technical leadership.

Key Responsibilities

1. GPU Systems Specialist

· Deploy, configure, and manage NVIDIA H100-powered GPU servers and networking infrastructure (InfiniBand).

· Optimize GPU performance for AI/ML workloads and troubleshoot hardware/software issues.

· Collaborate with NVIDIA and Dell teams for system optimization and technical support.

· Validate the GPU fractions, perform necessary checks and upgrades on GPU device drivers and Nvidia libraries.

2. Orchestration & Virtualization Specialist

· Implement and manage RUN.AI, Kubernetes, and NVIDIA MIG to optimize GPU resource allocation.

· Ensure seamless multi-tenant AI workload management and scaling strategies.

· Automate AI/ML pipeline orchestration for efficient resource utilization.

· Monitor the Services exposed via Kubernetes orchestration, enable GitOps model for the AI/ML workflows and ensure the K8s cluster works optimally.

· Monitor the Cluster usage, GPU quotas, Storage utilization and build performance reports on regular basis to assess the functionality of the K8s GPU stack.

3. AI Software Engineer

· Work with AI teams to enable model training and fine-tuning using PyTorch, TensorFlow, and RAPIDS.

· Develop and optimize AI/ML workflows on high-performance computing (HPC) environments.

· Integrate AI frameworks with cloud and on-prem GPU clusters.

4. Technical Manager

· Serve as the primary technical advisor to customers, ensuring seamless AI deployment.

· Conduct technical workshops, bootcamps, and onboarding sessions for users.

· Collaborate with universities, enterprises, and startups to drive AI adoption.

Requirements

  • Bachelor’s/Master’s degree in Computer Science, AI, Data Science, or related fields.
  • 3-5 years of hands-on experience in AI infrastructure, GPU computing, or cloud platforms.
  • Strong expertise in NVIDIA GPUs, Kubernetes, and workload orchestration.
  • Experience with AI model training, MLOps, and software engineering.
  • Excellent troubleshooting, automation, and scripting skills (Python, Bash, Terraform, etc.).
  • Strong communication and leadership skills to work with both technical and non-technical stakeholders.
  • Applicants must be willing to work at 3 Two Square, Petaling Jaya

Nice-to-Have

  • Experience with Dell PowerEdge servers, NVIDIA NVAIE suite, and InfiniBand networking.
  • Knowledge of data engineering, storage solutions (Dell PowerScale), and AI analytics.
  • Prior experience in a tech leadership role or customer-facing AI solutioning.

Benefits:

  • ESOS
  • Over Time
  • EPF, Socso and EIS
  • Annual Leave up to 30 days/ Medical Leave/ Medical Claim
  • Staff Purchase Scheme
  • Product Training
  • Oversea Trip: Asia, America, Australia, Europe
  • Salary Review - Twice per year

Job Types: Full-time, Permanent

Pay: RM5,000.00 - RM8,000.00 per month

Benefits:

  • Maternity leave
  • Opportunities for promotion
  • Professional development

Schedule:

  • Monday to Friday

Supplemental Pay:

  • Overtime pay

Application Question(s):

  • Are you able to work from Monday to Friday, 8.30am to 6.30pm?

Experience:

  • AI infrastructure and GPU computing: 3 years (Preferred)

Work Location: In person