Site Reliability Engineer

Full Time, onsite
TECH SOLUTIONS & DEVELOPMENT SDN BHD
Kuala Lumpur, Malaysia

RM 10,000 - RM 12,999 / month

Checking job availability...

Original

Simplified

Job Scope and Responsibilities Serve as a primary point responsible for the overall health, performance, and capacity of Great Eastern VMware Cloud Foundation platform. Function well in a fast-paced, rapidly changing environment and where things needs to be sorted at in a dynamic environment Experience with VMware virtualisation skills is a MUST (vSphere, NSX-T, vSAN, VCF, vROPS, vRNI, vRLI) Experience using and utilizing VROps, VRNI and VRLI for troubleshooting and analysis of incidents Understanding of NSX-T for configuration and using NSX-t for incident troubleshooting Knowledge and ability to use NSX-T Load Balancer Knowledge in renewing certificates in NSX-T Able to use and configure hardware alerts using available tools (VMware/HP/PaloAlto) Able to understand and use VM functions, datastores and backup application Knowledge and understanding of storage functions in VMs and ability to manage allocation and distribution of presented storage, for eg. VSAN, is required Prior experience with any one of cloud platforms - vCA, AWS or Azure Run the production environment by monitoring availability and taking a holistic view of system health Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of internal customer needs, and innovating for continual improvement Experience with general performance tuning and optimization of all aspects of platforms and services (systems, network). Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding (via vROPS, vRLI, vRNI) Enforce best practices for metrics gathering, monitoring, and alerting Participate in platform management, capacity planning and incident recovery Provide network administration and troubleshooting via vROPS and NSX-T Perform deep dives into both systemic and latent reliability issues Create sustainable systems and services through automation and uplifts Networking knowledge is a plus Exposure in Tanzu and Kubernetes is a plus Willing to work on & off work hours when required (Standby) Serve as a primary point responsible for the overall health, performance, and capacity of Great Eastern VMware Cloud Foundation platform. Function well in a fast-paced, rapidly changing environment and where things needs to be sorted at in a dynamic environment Experience with VMware virtualisation skills is a MUST (vSphere, NSX-T, vSAN, VCF, vROPS, vRNI, vRLI) Experience using and utilizing VROps, VRNI and VRLI for troubleshooting and analysis of incidents Understanding of NSX-T for configuration and using NSX-t for incident troubleshooting Knowledge and ability to use NSX-T Load Balancer Knowledge in renewing certificates in NSX-T Able to use and configure hardware alerts using available tools (VMware/HP/PaloAlto) Able to understand and use VM functions, datastores and backup application Knowledge and understanding of storage functions in VMs and ability to manage allocation and distribution of presented storage, for eg. VSAN, is required Prior experience with any one of cloud platforms - vCA, AWS or Azure Run the production environment by monitoring availability and taking a holistic view of system health Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of internal customer needs, and innovating for continual improvement Experience with general performance tuning and optimization of all aspects of platforms and services (systems, network). Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding (via vROPS, vRLI, vRNI) Enforce best practices for metrics gathering, monitoring, and alerting Participate in platform management, capacity planning and incident recovery Provide network administration and troubleshooting via vROPS and NSX-T Perform deep dives into both systemic and latent reliability issues Create sustainable systems and services through automation and uplifts Networking knowledge is a plus Exposure in Tanzu and Kubernetes is a plus Willing to work on & off work hours when required (Standby)