Site Reliability Engineer (SRE / GenAI Infrastructure / Kubernetes / IaC) Job at Atlantis IT group, Canada

OVoyQjZ5RmV0L3N5RSt2ejd6Y1lCYWhmTHc9PQ==
  • Atlantis IT group
  • Canada

Job Description

Role - Site Reliability Engineer (SRE /GenAI Infrastructure / Kubernetes / IaC)

Location - Montreal, QC

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization ( Docker ), orchestration ( Kubernetes , etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog , etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation




Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability



Job Tags

Similar Jobs

Gateway Area Development District

Local Government Advisor Job at Gateway Area Development District

 ...The Local Government Advisor is responsible for supporting regional planning and community development initiatives through research,...  ...development trainings related to specialties. Experience: Entry Level Position. One (1) year of related experience and/or related internship... 

VSolvit

TECHNICAL PROJECT MANAGER Job at VSolvit

 ...POSITION CAN BE 100% REMOTE OR HYBRID IN VENTURA, CA *** Job Summary Manages a team of technical project managers and support staff responsible for the delivery of IT projects. In coordination with the contract Program Manager, establishes project management and quality... 

Provenance Wealth Advisors

Chief Growth Officer Job at Provenance Wealth Advisors

Chief Growth Officer page is loaded## Chief Growth Officerlocations: Fort Lauderdale: West Palm Beach: Boca Raton: Miamitime type: Full timeposted...  ...id: JR100312**It's fun to work in a company where people truly BELIEVE in what they're doing!** *We're committed to bringing... 

Mid-Plains Community College

Nurse Educator Job at Mid-Plains Community College

 ...Responsible for teaching nursing theory and clinical in the LPN and ADN programs at North Platte Community College. MPCC is looking...  ...have five years of nursing experience. Current experience with educational technologies including online instruction and Distance... 

Cox Media Group

News Producer - WSB TV Job at Cox Media Group

 ...Location:GA-Atlanta Job Title: News Producer - WSB TV Position Overview WSB-TV Atlanta is in search of a News Producer who consistently crafts in-depth, fast-paced, memorable newscasts. The successful candidate must be able to own breaking news inside newscasts...