Epicareer Might not Working Properly
Learn More

Site Reliability Engineer

Salary undisclosed

Apply on


Original
Simplified

Description

  • System Administration: Manage and maintain server environments, ensuring high availability and performance. Troubleshoot and resolve system issues promptly.
  • Coding and Automation: Develop and maintain automation scripts and tools in Python to improve operational efficiency and reduce manual intervention.
  • UNIX Scripting: Write and optimize UNIX shell scripts for monitoring, alerting, and automation of routine tasks.
  • Collaboration: Work closely with development teams to implement reliable systems and applications, providing insights on best practices for reliability and performance.
  • Incident Response: Participate in on-call rotations and respond to incidents, performing root cause analysis and implementing preventive measures.
  • Documentation: Create and maintain documentation for system architecture, processes, and troubleshooting procedures.
  • Continuous Improvement: Identify opportunities for process improvement and implement solutions to enhance system reliability and performance.

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • Proven experience in system administration and managing Linux/UNIX environments.
  • Strong coding skills in Python and experience with scripting languages.
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).
  • Excellent problem-solving skills and ability to troubleshoot complex systems.
  • Strong communication and collaboration skills.