19 Apr
|
Tenstorrent
|
Toronto
19 Apr
Tenstorrent
Toronto
Apply on Kit Job: kitjob.ca/job/2g91wc
Join a pioneering AI technology team as a Site Reliability Engineer. Engage in ensuring system reliability and operational health with expertise in Linux and observability tools in a hybrid environment.
This position combines site reliability, infrastructure operations, and customer engineering. You will ensure the performance and observability of our cutting-edge AI solutions across both internal and customer environments. Collaborating with diverse engineering teams will be crucial in resolving production incidents and improving system integrity.
Key Responsibilities:
• Ensure reliability of AI systems across environments • Troubleshoot complex compute and networking issues • Collaborate with engineering for incident resolution • Design monitoring and alerting systems effectively • Build automation to enhance system reliability
Requirements: • Experience in site reliability or systems engineering • Robust Linux systems troubleshooting skills • Familiarity with Prometheus and Grafana • Scripting experience in Python, Go, or similar • Understanding of networking fundamentals at scale
Contribute to the future of AI infrastructure by leveraging your skills in operational health and system reliability. #J-18808-Ljbffr
Apply on Kit Job: kitjob.ca/job/2g91wc
📌 Site Reliability Engineer in AI Systems (Toronto)
🏢 Tenstorrent
📍 Toronto