Site Reliability Engineer (SRE / GenAI Infrastructure / Kubernetes / IaC) Job at Atlantis IT group, Canada

OVoyQjZ5RmV0L3N5RSt2ejd6Y1lCYWhmTHc9PQ==
  • Atlantis IT group
  • Canada

Job Description

Role - Site Reliability Engineer (SRE /GenAI Infrastructure / Kubernetes / IaC)

Location - Montreal, QC

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization ( Docker ), orchestration ( Kubernetes , etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog , etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation




Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability



Job Tags

Similar Jobs

Engbrecht Agency Staffing

Work from Home Sales Career — Full Training & Support Job at Engbrecht Agency Staffing

 ...the as a Remote Sales Professional and take full control of your income. Youll Get To: Work only with warm, qualified leads Present meaningful life insurance solutions Earn commissions directly tied to your performance The thrives when our agents... 

GD Mission Systems

System Test and Integration Engineer Job at GD Mission Systems

 ...Responsibilities for this Position System Test and Integration Engineer ID: 2025-68312 USA-VA-Manassas Required Clearance: Secret Posted Date: 9/24/2025 Category: Engineering-Systems Employment Type: Full Time Hiring Company: General Dynamics... 

WakeMed Health & Hospitals

Telephone Triage Nurse Job at WakeMed Health & Hospitals

 ...Responsible for providing appropriate health information as well as triaging the patient to the most appropriate level of care. By utilizing...  ...within a full spectrum of acuity. Utilizing comprehensive nursing skills to assess and advise patients and their families, a plan... 

First Student

School Bus Driver Job at First Student

 ...safety and innovation; they create and implement the most advanced training and technology the transportation industry has to offer. Now...  ...vary by location or CBA)~ No experience necessary. We offer paid CDL training! ~ No nights or weekends For our bus Driver... 

Heap

Senior Data engineer - Enterprise Analytics team - Spain Job at Heap

 ...never ask for money or contact you through random texts. For more information, visit our careers blog. Were looking for a Senior Data Engineer to join the Enterprise Analytics (EA) team. Your mission will be to help manage the data platform, extend its impact to...