Location: Uptime.com - Remote anywhere in Canada

Job Overview: At Uptime.com, we are seeking a highly skilled and experienced Site Reliability Engineer to join our dynamic team. As an SRE at Uptime.com, you will be instrumental in enhancing the reliability, availability, and performance of our SaaS systems. Your role will focus on designing, implementing, and maintaining scalable, resilient, and automated infrastructure solutions. You will collaborate with diverse teams, driving innovation and continuous improvement in our systems. Your expertise in areas such as monitoring, incident management, infrastructure as code, and cloud technologies will be pivotal in maintaining and optimizing our systems.

Key Responsibilities:

Design, implement, and maintain highly-available, scalable SaaS systems including web servers, background processing, and databases.
Develop and maintain infrastructure as code using tools like Terraform, Docker, and Kubernetes.
Ensure high availability and reliability of systems through effective monitoring, incident management, and proactive troubleshooting.
Collaborate with teams to implement and manage centralized logging systems for comprehensive environment visibility.
Identify and resolve performance bottlenecks and scalability issues, enhancing system performance.
Implement security measures against DDoS attacks and other threats, adhering to industry best practices.
Design and execute disaster recovery plans for business continuity.
Stay abreast of SRE trends and technologies, fostering a culture of improvement and innovation.
Independently manage complex infrastructure projects, demonstrating ownership and initiative.

Requirements

Educational Background: Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or a related field.
Professional Experience: Minimum of 3 years of experience in a Site Reliability Engineer role
Technical Expertise:
Proficiency in infrastructure as code tools (e.g., Terraform, Ansible).
Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
Strong understanding of cloud platforms (e.g., AWS, Azure, GCP) and their services.
Knowledge of scripting and automation using languages such as Python, Bash, or PowerShell.
Familiarity with CI/CD pipelines and related tools (e.g., Jenkins, GitLab CI).
System and Network Administration Skills: Experience in administering Linux/UNIX systems, network protocols, and services (DNS, HTTP, etc.).
Monitoring and Incident Management: Experience with monitoring tools (e.g., Prometheus, Grafana) and incident response strategies.
Security Practices: Understanding of security best practices, especially in securing web applications and mitigating DDoS attacks.
Performance Tuning: Ability to identify and resolve performance bottlenecks in high-traffic environments.
Disaster Recovery: Experience in designing and implementing disaster recovery plans.
Problem-Solving Skills: Strong analytical and problem-solving abilities to address complex technical challenges.
Communication and Collaboration: Excellent verbal and written communication skills; ability to collaborate effectively with cross-functional teams.
Continuous Learning: Commitment to staying up-to-date with the latest technologies and industry trends in site reliability engineering.
Certifications (optional but beneficial): Certifications such as AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), or similar credentials related to cloud, networking, or system administration.

Benefits

How We Will Support Your Growth and Success:

Collaborative Environment: Engage in meaningful collaborations with executives, leadership, and cross-functional teams, including engineering, marketing, and business operations. This exposure provides a comprehensive understanding of our business and fosters a holistic approach to problem-solving.
Innovative Industry Exposure: Dive into the dynamic world of monitoring, observability, and SRE.
Supportive Team Culture: Join a team of passionate, dedicated professionals united in our goal to build the best monitoring service in the world. Our supportive culture encourages knowledge sharing, mutual respect, and collective success.
Fully Remote Work Arrangements: Embrace the flexibility of working from home, anywhere in the world.
Unlimited Paid Time Off: Enjoy the freedom of unlimited paid time off, including vacation, sick days, and public holidays – a benefit extended to all employees, including our global contractors.
Family Leave Policies: Comprehensive family leave policies, including maternity and paternity leave.
Diverse and Inclusive Workplace: Be part of a company that values diversity and fosters an inclusive environment where everyone feels valued and empowered.