Epicareer Might not Working Properly
Learn More

Site Reliability Engineer

  • Full Time, onsite
  • Smart Teq Solution Sdn Bhd
  • Kawasan Sekitar Kuala Lumpur, Malaysia
Salary undisclosed

Checking job availability...

Original
Simplified

Responsibility

  • Ensure all our infrastructure are running at optimal condition.
  • Provide deployment, patches and update on all services that running on public cloud and on premise.
  • Identify and resolve support ticket that are related to our infrastructure and services.
  • Work closely with developer to provide a completed, up to date and readable documentation.
  • Develop SRE task related documentation for future reference and better tracing.
  • Monitor our services using Grafana and identify bottleneck if any. Provide immidiate action and troubleshooting when necessary.
  • Maintain, enhance our monitoring system including but not limited to Grafana, Victoria Metrics, Alert Manager.
  • Work closely with cross department to provide update and patch on our services using our CICD tools.
  • Identify on system log to provide better understand on service outage and issues.
  • Perform preventive maintenance to our system and infra.
  • Always willing to learn new technology and tools.

Requirement

  • Having 1 years or more in DevOps, Network engineer, SRE related field is required.
  • Familiar with Linux and networking related skills.
  • Able to work and solve problems independently when required.
  • Having hands-on experience with bash script.
  • Brief understanding on how cloud infrastructure (Alicloud, AWS, GCP and more) works.
  • Able to work on call
  • Willing to learn new technology such as Grafana, Terraform, Gitlab CI/CD, ArgoCD and Ansible.

Nice to have

  • Understand how docker and kubernetes work
  • Programming experience. (python and golang)
  • Brief understanding on Terraform, Ansible, Packer is a plus
  • Having hands-on knowledge in cloud computing, kubernetes, Gitlab etc.
  • Having hands-on knowledge in Terraform and Ansible related skills.

Responsibility

  • Ensure all our infrastructure are running at optimal condition.
  • Provide deployment, patches and update on all services that running on public cloud and on premise.
  • Identify and resolve support ticket that are related to our infrastructure and services.
  • Work closely with developer to provide a completed, up to date and readable documentation.
  • Develop SRE task related documentation for future reference and better tracing.
  • Monitor our services using Grafana and identify bottleneck if any. Provide immidiate action and troubleshooting when necessary.
  • Maintain, enhance our monitoring system including but not limited to Grafana, Victoria Metrics, Alert Manager.
  • Work closely with cross department to provide update and patch on our services using our CICD tools.
  • Identify on system log to provide better understand on service outage and issues.
  • Perform preventive maintenance to our system and infra.
  • Always willing to learn new technology and tools.

Requirement

  • Having 1 years or more in DevOps, Network engineer, SRE related field is required.
  • Familiar with Linux and networking related skills.
  • Able to work and solve problems independently when required.
  • Having hands-on experience with bash script.
  • Brief understanding on how cloud infrastructure (Alicloud, AWS, GCP and more) works.
  • Able to work on call
  • Willing to learn new technology such as Grafana, Terraform, Gitlab CI/CD, ArgoCD and Ansible.

Nice to have

  • Understand how docker and kubernetes work
  • Programming experience. (python and golang)
  • Brief understanding on Terraform, Ansible, Packer is a plus
  • Having hands-on knowledge in cloud computing, kubernetes, Gitlab etc.
  • Having hands-on knowledge in Terraform and Ansible related skills.