Join a progressive team as an SRE focusing on cloud infrastructure for GenAI systems. With 8+ years of experience, you’ll drive automation and implement robust monitoring strategies.
This position involves scaling and supporting infrastructure for cutting-edge GenAI applications. You will automate GPU clusters and define crucial SLOs and SLAs while maintaining a strong focus on incident response. Your comprehensive understanding of networking and system engineering will be vital to achieving operational excellence.
Key Responsibilities:
• Scale and automate GPU cluster operations
• Define SLOs and SLAs for system reliability
• Implement monitoring solutions and incident responses
• Optimize cloud infrastructure for performance
• Drive security and disaster recovery initiatives
Requirements:
• Minimum 8 years in Site Reliability Engineering
• Expertise with monitoring tools like Datadog and ELK
• Strong background in networking and systems
• Experience in finance or security regulations a plus
• AI/ML infrastructure knowledge is beneficial
Make an impact by enhancing system reliability and security for advanced GenAI applications.
#J-18808-Ljbffr
Apply on Kit Job: kitjob.ca/job/2fs6km
📌 Site Reliability Engineer for Cloud Infrastructure and GenAI Systems (Toronto)
🏢 apptoza
📍 Toronto
Reply to this offer
Impress this employer describing Your skills and abilities, fill out the form below and leave Your personal touch in the presentation letter.