19 Apr
|
HCLTech
|
Toronto
Apply on Kit Job: kitjob.ca/job/2g9cg7
Join our SRE L2 squad supporting ~1000 AWS-hosted services. You’ll own operational reliability, rapid triage, and proactive maintenance across production and non-prod, partnering closely with Cloud Engineering, SOC, and application teams.
Key Responsibilities
Deliver 24×7 monitoring, incident response, and problem management; drive MTTA/MTTR reduction and SLO/SLI adherence.
Perform preventive health checks; analyze ticket trends to implement continual service improvements and automation to reduce toil.
Execute blameless postmortems and high-quality RCA; maintain SOPs/runbooks and reliability dashboards.
Configure/tune observability (Dynatrace, CloudWatch, ELK); enable self-healing workflows and workload optimizations.
Support change/service requests within agreed SLAs; collaborate during transitions and onboard current AWS services.
Core Skills & Tools
AWS:
Lambda, ECS/Fargate/EC2, API Gateway, SNS/SQS, Kinesis, RDS; IAM/KMS foundations.
Observability & ITSM:
Dynatrace, CloudWatch,
ELK; ServiceNow for incidents/changes; SLI/SLO dashboards.
Toil Reduction
Reliability Practices:
Error budgets, capacity/performance benchmarking, automation/runbook execution, FinOps awareness.
Qualifications
5+ years SRE/DevOps or L2 operations for cloud-native stacks; strong AWS production experience.
Proven incident/change/problem management in 24×7 environments; adept at RCA and postmortems.
Hands‑on with observability tooling and operational automation; excellent collaboration and documentation skills.
Shift Coverage & Locations Follow-the-sun model with overlapping handoffs across Canada/India to ensure continuous support. Success is measured by uptime, MTTR/MTTD, change failure rate, error‑budget consumption, SLO adherence, RCA quality, and CSI throughput.
#J-18808-Ljbffr
Apply on Kit Job: kitjob.ca/job/2g9cg7
📌 Site Reliability Engineer (Toronto)
🏢 HCLTech
📍 Toronto