Site Reliability Engineer (SRE / GenAI Infrastructure / Kubernetes / IaC) Job at Atlantis IT group, Canada

OVoyQjZ5RmV0L3N5RSt2ejd6Y1lCYWhmTHc9PQ==
  • Atlantis IT group
  • Canada

Job Description

Role - Site Reliability Engineer (SRE /GenAI Infrastructure / Kubernetes / IaC)

Location - Montreal, QC

Production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization ( Docker ), orchestration ( Kubernetes , etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog , etc.)
Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments
Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus
Excellent communication, documentation, and cross-team collaboration skills
Proven track record of reducing operational toil via automation




Experience: 8+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineer-ing knowledge.
Roles and Responsibilities:
Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, secu-rity) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault toler-ance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability



Job Tags

Similar Jobs

Med Source Consultants

Physiatry - 7177 Job at Med Source Consultants

 ...EMG studies * Wonderful location in Burlington, VT * Weekend Call once per month * Great pay rate, includes malpractice * Housing and travel allowances provided * Call or Text Jordan Rosario @ (***) ***-**** or email ****@*****.*** Locums... 

First United Bank

Director of Financial Planning & Analysis Job at First United Bank

 ...compliance exams on an annual basis* Perform other duties as assigned* Bachelor's degree in Finance, Accounting, Economics, or related field (MBA, CPA, or CFA preferred)* 10+ years of progressive experience in FP&A or corporate finance, with at least 3-5 years in the banking or... 

Boston Health Care for the Homeless Program

Senior Director of Corporate and Foundation Relations - Non-Profit Healthcare Fundraising Job at Boston Health Care for the Homeless Program

Who We Are: Since 1985, BHCHPs mission has been to ensure unconditionally equitable and dignified access to the highest quality health care for all individuals and families experiencing homelessness in greater Boston. Over 10,000 homeless individuals are cared for ...

American Striping Company

Class B Pavement Marking Technicians Job Job at American Striping Company

 ...apply to as many jobs as you'd like. Create a Driver's Account.Person to Contact about this CDL Job: Alejandra HarveyAmerican Striping Company Phone Number: (***) ***-**** Tell em' Gary's Job Board sent you.This truck driving job may have an alternate application method... 

Liberty Ready Mix

Ready Mix Truck Driver Job at Liberty Ready Mix

 ...Liberty Ready Mix is looking for service-focused drivers to join our concrete mixer truck fleet, whether youre wanting to use your existing driving experience or aspiring to join a company that will help you get certified. Were a locally-owned and family-oriented company...