Site Reliability Engineer (Bilingual - English and Mandarin)
- Design and implement resilient system architectures that support high availability and scalability.
- Develop automation tools and scripts to enhance operational efficiency and reduce manual effort.
- Define, track, and analyze SLOs and SLIs to ensure reliability and performance meet business needs.
- Conduct thorough post-mortem analyses following incidents, driving continuous improvement through root cause identification and solution implementation.
- Collaborate with development and operations teams to establish best practices in system reliability and incident management.
- Troubleshoot and resolve issues related to database performance, network connectivity, and deployment failures, including diagnosing problems at the underlying platform level (e.g., Kubernetes, virtual machines).
- Ensure that issues are resolved within the stipulated Service Level Agreements (SLAs), maintaining high standards of service delivery.
- Identify and troubleshoot performance bottlenecks across systems, providing actionable recommendations for enhancements.
- Maintain detailed documentation of processes and incident responses to support knowledge sharing and compliance.
- Develop automation tools and scripts to enhance operational efficiency and reduce manual effort.
Qualifications: Preferred Skills:
- EPF
- SOCSO
- Medical insurance
- Bonus
- Annual and medical leave shall be in accordance with the prevailing Labour Law