Site Reliability Engineer

Salary undisclosed

Apply on

Original

Simplified

System Administration: Manage and maintain server environments, ensuring high availability and performance. Troubleshoot and resolve system issues promptly
Coding and Automation: Develop and maintain automation scripts and tools in Python to improve operational efficiency and reduce manual intervention
UNIX Scripting: Write and optimize UNIX shell scripts for monitoring, alerting, and automation of routine tasks
Collaboration: Work closely with development teams to implement reliable systems and applications, providing insights on best practices for reliability and performance
Incident Response: Participate in on-call rotations and respond to incidents, performing root cause analysis and implementing preventive measures
Documentation: Create and maintain documentation for system architecture, processes, and troubleshooting procedures
Continuous Improvement: Identify opportunities for process improvement and implement solutions to enhance system reliability and performance

Requirements

Bachelor's degree in Computer Science, Information Technology, or a related field
Proven experience in system administration and managing Linux/UNIX environments
Strong coding skills in Python and experience with scripting languages
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack)
Excellent problem-solving skills and ability to troubleshoot complex systems
Strong communication and collaboration skills

Similar Jobs

1d ago

ExxonMobil