17 Apr
|
Galent
|
Toronto
Apply on Kit Job: kitjob.ca/job/2fxao0
Role : SRE With AI OPS
Location: Toronto ON Canada - Day 1 Onsite
Job Description: SRE / AI Ops Engineer
Overview
We are seeking a highly skilled Site Reliability Engineer (SRE) / AI Ops Engineer to design, build, and operate intelligent, automated reliability solutions across our production environments. This role blends deep operational expertise with up-to-date AI‑driven observability, monitoring, and automation practices. You will work with industry‑leading tools—Dynatrace, Splunk, Moogsoft, Pager Duty, Ansible, Git/Git Hub Actions, and Python—to create proactive, self‑healing, AI‑enhanced workflows that elevate system reliability and reduce manual toil.
This is a hands‑on engineering role for someone who thrives at the intersection of SRE, automation, and AI‑powered operations.
Key Responsibilities
AI‑Driven Observability & Monitoring
- Implement and optimize monitoring solutions using Dynatrace, Splunk, and Moogsoft, leveraging their AI/ML capabilities (e.g., Davis AI, Splunk ITSI, Moogsoft AIOps) to:
- Detect anomalies
- Predict incidents
- Correlate events across distributed systems
- Reduce alert noise through intelligent clustering
AI Ops Workflow Engineering
- Design and build AI‑powered operational workflows that automate:
- Incident detection
- Root cause analysis
- Remediation actions
- Post‑incident insights
- Integrate AI insights from observability platforms into automated pipelines and runbooks.
Incident Response & Automation
- Configure and manage Pager Duty for intelligent alerting, escalation policies, and automated incident response.
- Build self‑healing automation using Ansible, Python, and Git Hub Actions.
- Develop automated remediation playbooks triggered by AI‑driven events.
Platform Reliability & SRE Practices
- Apply SRE principles such as SLOs, SLIs, error budgets, and chaos testing.
- Improve system reliability through automation, performance tuning, and proactive engineering.
- Reduce operational toil by designing scalable, automated solutions.
Dev Ops & CI/CD Integration
- Use Git and Git Hub Actions to build automated pipelines that integrate:
- Observability signals
- AI‑driven quality gates
- Automated rollback and recovery workflows
Python Scripting & Tooling
- Develop Python‑based automation, data processing, and AI‑enhanced operational tooling.
- Build integrations between monitoring platforms, ticketing systems, and automation engines.
Required Skills & Experience
Core Technical Skills
- Hands‑on experience with:
- Dynatrace (including Davis AI)
- Splunk (ITSI, Machine Learning Toolkit preferred)
- Moogsoft AIOps
- Pager Duty
- Ansible
- Git & Git Hub Actions
- Python scripting
AI Ops & Automation
- Experience leveraging AI/ML features within observability and incident‑management tools.
- Ability to design automated workflows that use AI insights for:
- Event correlation
- Predictive alerting
- Automated remediation
- Intelligent routing
SRE Expertise
- Strong understanding of distributed systems, cloud infrastructure, and reliability engineering.
- Experience with SLO/SLI design, error budgets, and performance optimization.
- Familiarity with containerized environments (Kubernetes, Docker) is a plus.
Soft Skills
- Strong analytical mindset with a passion for automation and continuous improvement.
- Excellent communication and cross‑team collaboration abilities.
- Ability to translate operational challenges into scalable engineering solutions.
Preferred Qualifications
- Experience with cloud platform Redhat Openshift
- Exposure to LLM‑based automation or generative AI for operational workflows.
- Background in building or integrating with Chat Ops frameworks.
- Knowledge of event‑driven architectures and message queues.
Apply on Kit Job: kitjob.ca/job/2fxao0
📌 SRE With AI OPS (Toronto)
🏢 Galent
📍 Toronto