Site Reliability Engineer
Salary undisclosed
Apply on
Original
Simplified
- System Administration: Manage and maintain server environments, ensuring high availability and performance. Troubleshoot and resolve system issues promptly
- Coding and Automation: Develop and maintain automation scripts and tools in Python to improve operational efficiency and reduce manual intervention
- UNIX Scripting: Write and optimize UNIX shell scripts for monitoring, alerting, and automation of routine tasks
- Collaboration: Work closely with development teams to implement reliable systems and applications, providing insights on best practices for reliability and performance
- Incident Response: Participate in on-call rotations and respond to incidents, performing root cause analysis and implementing preventive measures
- Documentation: Create and maintain documentation for system architecture, processes, and troubleshooting procedures
- Continuous Improvement: Identify opportunities for process improvement and implement solutions to enhance system reliability and performance
- Bachelor's degree in Computer Science, Information Technology, or a related field
- Proven experience in system administration and managing Linux/UNIX environments
- Strong coding skills in Python and experience with scripting languages
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack)
- Excellent problem-solving skills and ability to troubleshoot complex systems
- Strong communication and collaboration skills
Similar Jobs