MLOps / DevOps Engineer
PolandSenior
We are growing fast - and we invite you to grow with us. In Innowise, you can not only develop as an expert in your field, solve complex problems and influence the result, but also see how the finished project affects the world around. We are a close-knit team of professionals who have already implemented 1600+ cases for clients from the USA, Denmark, Germany, etc. We need someone, who will strengthen our team and become part of the community!
You need to have a proven experience in followong:
- Infrastructure & IaC: Minimum 250 person-days managing server/cloud infrastructure (Public/Private), IaC tools (Terraform, Bicep, etc.), Docker, and Kubernetes
- CI/CD: Minimum 300 person-days designing and maintaining CI/CD solutions in production environments
- MLOps / ALM: Minimum 200 person-days deploying and utilizing MLflow, Kubeflow, ClearML, or similar platforms
- Incident Management: Minimum 150 person-days in root cause analysis and stabilizing critical systems
- Strong Linux systems engineering background (RHEL/Rocky/SLES)
- Proficiency in Python and Bash for automation
- Experience with relational databases (PostgreSQL), NFS, and Object stores (S3-compatible)
- Experience with Data Analytics & Data Analysis (working with Databricks)
- Strong analytical mindset with an implementation-oriented approach
- Ability to translate business requirements into scalable technical solutions
- Excellent cross-functional collaboration (with Data Scientists, Developers, and IT operations)
Will be a plus:
- Deep experience with HPC schedulers (PBS Professional, Torque, Slurm) and building integrations (hooks, prolog/epilog scripts)
- Experience bridging traditional HPC schedulers with modern cloud-native platforms (Kubernetes, MLOps stacks) and configuring dynamic scaling (cloud bursting)
- Scripting abilities in Go or Rust
- Proficiency in SQL and PowerShell
- Familiarity with MPI workloads (OpenMPI, MPICH) and GPU scheduling (NVIDIA stack, MIG/MPS)
- Experience with parallel file systems (Lustre strongly preferred)
- Configuration management experience (Ansible, Puppet, or similar)
Key Responsibilities:
- Design, deploy, and support resilient infrastructure for machine learning platforms and data pipelines using Python and SQL
- Implement Application Lifecycle Management (ALM) for machine learning, automating training, versioning, and deployment processes (MLflow, Kubeflow, ClearML, or similar enterprise ML platforms)
- Ensure reliability, scalability, and high availability of the MLOps infrastructure and backend services
- Design and manage distributed compute environments (bare metal, VM, private/public cloud)
- Containerize ML services and applications using Docker and Kubernetes, orchestrating smooth production rollouts
- Automate infrastructure provisioning, cluster lifecycle, and configuration using Infrastructure as Code (Terraform, Bicep, ARM, etc.)
- Build, integrate, and maintain robust CI/CD pipelines (GitLab CI, GitHub Actions, Jenkins)
- Implement comprehensive observability (logging, metrics, dashboards) for overall cluster health
- Diagnose bottlenecks, resolve node/network failures, and conduct Root Cause Analysis (RCA) as part of proactive incident management
We offer
Flexible work schedule
Experience of working with clients all over the world
Financial assistance
Medical insurance
Want to join the team?
Email us
Related opportunities
Any questions about the job?
Ask them to our recruiters by writing to the mail:
job@innowise.com