Epicareer Might not Working Properly
Learn More

Senior Site Reliability Engineer

Salary undisclosed

Apply on


Original
Simplified

Position: Site Reliability Engineer

Location: Bangsar South

Type: Permanent

Mode: Hybrid

Benefit: EPF, Socso, Medical, Bonus, VISA

Job Description

• Support & oversee availability, reliability, resilience, performance, security and monitoring of applications on Azure Cloud and various supporting platforms to ensure business operational SLA’s are met

• Work closely with service delivery function to undertake incident management, operational cost management, service improvement and ongoing application health monitoring.

• Create a link between development and operations by applying a software engineering mindset to application reliability activities and instil the culture in agile development teams

• Maintain and improve the resiliency of core applications and infrastructure platform through a continuous improvement backlog

• Provide continued improvement to the platform infrastructure through automation and standardisation

• Drive best practice operational excellence for secure, high-performing, resilient, efficient infrastructure and cost optimised applications and workloads.

• Maintain existing automation infrastructure used to identify risks in areas such as performance, reliability, capability and scalability

• Possess a modern approach aligned to things such as Infrastructure as Code, Configuration as Code, and DevOps

• Be responsible for deployment, maintenance, and enhancements of DevSecOps automation workflows of a tribe working closely with developers

• Conduct a root cause analysis on incidents and provide code fixes for permanent remediation

• Document automation processes and runbooks across all environments and technical administration tasks

• Champion the adoption and culture change required for the continuous improvement in application reliability and embedding these in day to day development practices

Requirement

• You are a broadly skilled engineer with an interest in service reliability, automation, monitoring, scalability and high-availability systems and/or capacity/cost planning. But you have the breadth of knowledge necessary to support a wide variety of software and systems

• You have an analytical mindset, natural curiosity, initiative and willingness to think outside the box to solve problems, using engineering approaches to running better production systems

• You have sustained track record of making significant, self-directed end-to-end design and implementation contributions to DevSecOps engineering practices and cloud infrastructure management using IAC as well as development practices

• You have experience in executing a support function for customer facing products and services handling incidents under a service management framework and agile methodologies

• You understand application of modern DevSecOps architectures as to why dockers and containers are useful. You can use traditional configuration management such as Chef, Ansible, or Terraform as well as modern infrastructure schedulers like Kubernetes. You enjoy working with the latest monitoring and metrics platforms such as Prometheus/Graphana

• You have experience in coding with various languages, when a problem needs a software solution, you roll up your sleeves and get to work. You have a well-rounded and “technology agnostic” pragmatic approach towards best tool for the situation

• You have experience in driving technology, people, process, culture change to instil site reliability in development practices

• Following technology experience will be highly advantageous:

o DevSecOps - Prometheus, Graphana, Azure Cloud Management, Azure Monitor, Azure App Insights, Splunk, Solar winds, Azure Devops, Docker, Kubernetes, Ansible, YAML Atlassian tool chain, Sonar Cloud, Microsoft SQL Server, Networking (DNS, HTTP, WAF, Load balancing, Reverse Proxy) , Web Servers (Nginx, Apache/Tomcat).

o Development - Angular.js, Node.js, SQL, Apache Kafka