Site Reliability Engineer for Cloud Infrastructure and GenAI Systems (Toronto)

Site Reliability Engineer for Cloud Infrastructure and GenAI Systems (Toronto)

17 Apr
|
apptoza
|
Toronto

17 Apr

apptoza

Toronto

Join a progressive team as an SRE focusing on cloud infrastructure for GenAI systems. With 8+ years of experience, you’ll drive automation and implement robust monitoring strategies.
This position involves scaling and supporting infrastructure for cutting-edge GenAI applications. You will automate GPU clusters and define crucial SLOs and SLAs while maintaining a strong focus on incident response. Your comprehensive understanding of networking and system engineering will be vital to achieving operational excellence.
Key Responsibilities:
• Scale and automate GPU cluster operations
• Define SLOs and SLAs for system reliability
• Implement monitoring solutions and incident responses
• Optimize cloud infrastructure for performance
• Drive security and disaster recovery initiatives
Requirements:
• Minimum 8 years in Site Reliability Engineering
• Expertise with monitoring tools like Datadog and ELK
• Strong background in networking and systems
• Experience in finance or security regulations a plus
• AI/ML infrastructure knowledge is beneficial
Make an impact by enhancing system reliability and security for advanced GenAI applications.
#J-18808-Ljbffr

📌 Site Reliability Engineer for Cloud Infrastructure and GenAI Systems (Toronto)
🏢 apptoza
📍 Toronto

Reply to this offer

Impress this employer describing Your skills and abilities, fill out the form below and leave Your personal touch in the presentation letter.

Subscribe to this job alert:
Enter Your E-mail address to receive the latest job offers for: site reliability engineer for cloud infrastructure and genai systems (toronto) / toronto
Subscribe to this job alert:
Enter Your E-mail address to receive the latest job offers for: site reliability engineer for cloud infrastructure and genai systems (toronto) / toronto