Epicareer Might not Working Properly
Learn More
P

Chaos Engineering Specialist

  • Full Time, onsite
  • Payment Network Malaysia
  • Kuala Lumpur Engineering - Software (Information & Communication Technology) Full time, Malaysia
Salary undisclosed

Apply on


Original
Simplified

A) SUMMARY OF RESPONSIBILITIES

We are looking for a skilled and passionate Platform Engineer with expertise in Chaos Engineering and resiliency testing. The ideal candidate will have a strong background in distributed systems, cloud infrastructure, and container orchestration. You will be responsible for designing, implementing, and managing chaos experiments to test the resilience of our platform. Your work will directly contribute to our platform's ability to withstand and recover from unexpected failures, ensuring continuous and reliable service for our clients.

B) KEY AREAS OF RESPONSIBILITIES

  • Develop and implement chaos engineering strategies to test the resilience of our platform infrastructure.
  • Design, execute, and automate chaos experiments using tools such as Gremlin, Chaos Mesh, Litmus, or similar.
  • Collaborate with platform engineering and DevOps teams to identify critical systems and components for testing.
  • Build and maintain a robust monitoring and observability framework to analyze the impact of chaos experiments.
  • Identify weaknesses in the current infrastructure and provide recommendations for improvement.
  • Integrate chaos engineering practices into CI/CD pipelines using GitOps tools like ArgoCD and Atlantis.
  • Contribute to the development and maintenance of Kubernetes clusters, AWS EMR, AWS MSK Kafka, and VSphere environments.
  • Utilize Terraform for infrastructure as code (IaC) to manage cloud resources.
  • Participate in on-call rotation and assist in incident management and root cause analysis.
  • Stay up to date with the latest trends and best practices in chaos engineering, resiliency testing, and cloud infrastructure.
  • C) FUNCTIONAL COMPETENCIES

    Functional Competencies

  • Strong understanding of Kubernetes, Docker, and container orchestration.
  • Proficiency in AWS services, including EMR, MSK Kafka, and experience with VSphere.
  • Experience with infrastructure as code (IaC) tools, particularly Terraform.
  • Familiarity with GitOps practices and tools such as ArgoCD and Atlantis.
  • Hands-on experience with chaos engineering tools (e.g., Gremlin, Chaos Mesh, Litmus).
  • Solid understanding of distributed systems, microservices architecture, and cloud-native technologies.
  • Excellent problem-solving skills and a proactive approach to identifying and addressing potential issues.
  • Strong communication skills and the ability to work effectively in a collaborative team environment.
  • D) QUALIFICATIONS & EXPERIENCE

    Minimum Qualifications

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
  • 3+ years of experience in platform engineering, site reliability engineering (SRE), or DevOps roles, with a focus on chaos engineering.
  • APPLY