Site Reliability Engineer in AI Systems (Toronto)

Site Reliability Engineer in AI Systems (Toronto)

19 Apr
|
Tenstorrent
|
Toronto

19 Apr

Tenstorrent

Toronto

Join a pioneering AI technology team as a Site Reliability Engineer. Engage in ensuring system reliability and operational health with expertise in Linux and observability tools in a hybrid environment.

This position combines site reliability, infrastructure operations, and customer engineering. You will ensure the performance and observability of our cutting-edge AI solutions across both internal and customer environments. Collaborating with diverse engineering teams will be crucial in resolving production incidents and improving system integrity.

Key Responsibilities:



• Ensure reliability of AI systems across environments • Troubleshoot complex compute and networking issues • Collaborate with engineering for incident resolution • Design monitoring and alerting systems effectively • Build automation to enhance system reliability

Requirements: • Experience in site reliability or systems engineering • Robust Linux systems troubleshooting skills • Familiarity with Prometheus and Grafana • Scripting experience in Python, Go, or similar • Understanding of networking fundamentals at scale

Contribute to the future of AI infrastructure by leveraging your skills in operational health and system reliability. #J-18808-Ljbffr

📌 Site Reliability Engineer in AI Systems (Toronto)
🏢 Tenstorrent
📍 Toronto

Reply to this offer

Impress this employer describing Your skills and abilities, fill out the form below and leave Your personal touch in the presentation letter.

Subscribe to this job alert:
Enter Your E-mail address to receive the latest job offers for: site reliability engineer in ai systems (toronto) / toronto
Subscribe to this job alert:
Enter Your E-mail address to receive the latest job offers for: site reliability engineer in ai systems (toronto) / toronto