SRE / AI Ops Engineer (Toronto)

SRE / AI Ops Engineer (Toronto)

17 Apr
|
Themesoft
|
Toronto

17 Apr

Themesoft

Toronto

Job Title: SRE / AI Ops Engineer

Location: Toronto, ON (3 or 4 days onsite a week)

Duration: Long Term Contract

Job Description:

Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) / AI Ops Engineer to design, build, and operate intelligent, automated reliability solutions across our production environments. This role blends deep operational expertise with modern AI‑driven observability, monitoring, and automation practices. You will work with industry‑leading tools—Dynatrace, Splunk, Moogsoft, Pager Duty, Ansible, Git/Git Hub Actions, and Python—to create proactive, self‑healing, AI‑enhanced workflows that elevate system reliability and reduce manual toil.

This is a hands‑on engineering role for someone who thrives at the intersection of SRE, automation, and AI‑powered operations.

Key Responsibilities

AI‑Driven Observability & Monitoring

- Implement and optimize monitoring solutions using Dynatrace, Splunk, and Moogsoft, leveraging their AI/ML capabilities (e.g., Davis AI, Splunk ITSI, Moogsoft AIOps) to:
- Detect anomalies
- Predict incidents
- Correlate events across distributed systems
- Reduce alert noise through intelligent clustering

AI Ops Workflow Engineering

- Design and build AI‑powered operational workflows that automate:
- Incident detection
- Root cause analysis
- Remediation actions
- Post‑incident insights
- Integrate AI insights from observability platforms into automated pipelines and runbooks.

Incident Response & Automation

- Configure and manage Pager Duty for intelligent alerting, escalation policies, and automated incident response.
- Build self‑healing automation using Ansible, Python, and Git Hub Actions.
- Develop automated remediation playbooks triggered by AI‑driven events.

Platform Reliability & SRE Practices

- Apply SRE principles such as SLOs, SLIs, error budgets, and chaos testing.
- Improve system reliability through automation, performance tuning, and proactive engineering.
- Reduce operational toil by designing scalable, automated solutions.





Dev Ops & CI/CD Integration

- Use Git and Git Hub Actions to build automated pipelines that integrate:
- Observability signals
- AI‑driven quality gates
- Automated rollback and recovery workflows

Python Scripting & Tooling

- Develop Python‑based automation, data processing, and AI‑enhanced operational tooling.
- Build integrations between monitoring platforms, ticketing systems, and automation engines.

Required Skills & Experience

Core Technical Skills

- Hands‑on experience with:
- Dynatrace (including Davis AI)
- Splunk (ITSI, Machine Learning Toolkit preferred)
- Moogsoft AIOps
- Pager Duty
- Ansible
- Git & Git Hub Actions
- Python scripting

AI Ops & Automation

- Experience leveraging AI/ML features within observability and incident‑management tools.
- Ability to design automated workflows that use AI insights for:
- Event correlation
- Predictive alerting
- Automated remediation
- Intelligent routing

SRE Expertise

- Solid understanding of distributed systems, cloud infrastructure, and reliability engineering.
- Experience with SLO/SLI design, error budgets, and performance optimization.
- Familiarity with containerized environments (Kubernetes, Docker) is a plus.

Soft Skills

- Strong analytical mindset with a passion for automation and continuous improvement.
- Excellent communication and cross‑team collaboration abilities.
- Ability to translate operational challenges into scalable engineering solutions.

Preferred Qualifications

- Experience with cloud platform Redhat Openshift
- Exposure to LLM‑based automation or generative AI for operational workflows.
- Background in building or integrating with Chat Ops frameworks.
- Knowledge of event‑driven architectures and message queues.

What You’ll Achieve

In this role, you will help transform traditional application and infrastructure operations into a modern, AI‑enhanced reliability ecosystem. You’ll build systems that not only detect and respond to issues but learn from them—driving a future where operations are predictive, automated, and intelligent.

Thanks & Regards,

Vignesh

📌 SRE / AI Ops Engineer (Toronto)
🏢 Themesoft
📍 Toronto

Reply to this offer

Impress this employer describing Your skills and abilities, fill out the form below and leave Your personal touch in the presentation letter.

Subscribe to this job alert:
Enter Your E-mail address to receive the latest job offers for: sre / ai ops engineer (toronto) / toronto
Subscribe to this job alert:
Enter Your E-mail address to receive the latest job offers for: sre / ai ops engineer (toronto) / toronto