Site Reliability Engineer (Night Shifts)

Middle

Full time

Vacancy is currently inactive

Position overview

We’re looking for a Site Reliability Engineer (SRE) to join the Fintech.Core platform team and take ownership of L3-level customer support and system reliability for a white-label fintech product deployed on customer-managed infrastructure.

The role focuses on monitoring, incident response, troubleshooting production issues, and close communication with clients during night-time support hours.

You will work with an AWS-based, Kubernetes-driven fintech platform built using Terraform, Helm, and modern observability tooling. As part of the role, you will also participate in configuring and maintaining monitoring dashboards and alerting to ensure system stability and fast incident detection. This position combines classic SRE practices with deep operational support responsibilities in a regulated fintech environment.

Projects description A modular white-label fintech platform based on microservices architecture designed for financial institutions (e.g. challenger banks, PSPs, trading and payment platforms). The platform is cloud-agnostic by design and currently runs on AWS, using Kubernetes, .NET Core, PostgreSQL, and Traefik, with observability tools like Prometheus, Grafana, and Loki. Infrastructure is provisioned via Terraform and follows a parametric model, enabling full environment deployment within a business day.

Current Use Case: A financial platform for a regulated institution with EMI and VASP licenses in the Philippines. The solution will combine digital banking and cryptocurrency services, including secure wallets, payments, remittances, and card issuing. A key part of the project is an in-house crypto trading module designed to provide a seamless and compliant experience for a growing global user base.

The size and the structure of the team

DevOps Engineer Backend Engineer ReactJS Engineer iOS Engineer Android Engineer Business Analyst QA Engineer Project Manager

YOUR BACKGROUND

Hands-on experience with AWS, primarily EKS-based environments
Strong knowledge of Kubernetes, including workloads, networking, ingress controllers, and Helm
Experience working with monitoring and observability stacks: Grafana, Prometheus, Loki, Tempo
Experience troubleshooting production incidents in distributed systems
Solid understanding of Linux/Unix systems and operational debugging
Experience administering and troubleshooting PostgreSQL
Familiarity with CI/CD pipelines (GitHub Actions)
Ability to read and understand Infrastructure as Code (Terraform, Helm)
Confident English communication (written and spoken, B2 or higher) for direct customer interaction
Ability to work independently during off-hours and make technical decisions
English level - B2

Skills that will be a plus:

Experience designing alerting strategies (SLOs, SLIs, noise reduction)
Familiarity with Crashlytics or similar mobile monitoring tools
Hands-on experience with Terraform
Experience with Traefik ingress controller
Experience with RabbitMQ
Basic scripting skills (Python, Bash)

Responsibilities

Provide L3 technical support for white-label customer environments deployed on their infrastructure
Monitor production systems and respond to incidents during night shift coverage (approx. 02:00 - 10:00 EET)
Act as the primary technical escalation point for customer-reported issues via dedicated communication channels
Investigate and troubleshoot incidents across Kubernetes, infrastructure, CI/CD, networking, and databases
Work actively with Grafana dashboards, alerts, and logs to identify root causes and prevent recurrence
Design, configure, and maintain monitoring and alerting dashboards (Grafana, Prometheus, Loki, Tempo)
Monitor mobile application stability via Crashlytics and correlate client-side crashes with backend incidents
Collaborate with development teams to escalate bugs, propose fixes, and improve platform reliability
Participate in incident postmortems and contribute to reliability improvements and runbooks
Improve automation, observability, and self-healing capabilities of the platform
Document incidents, troubleshooting steps, and operational best practices

WE OFFER

20 billable days off in the first year of cooperation, all next years - 25 billable days off
Fair and competitive compensation
Friendly team and enjoyable working environment
Clearly described business processes in the company that really work
Regular updates on company news, Q&A sessions with top management
Flexible work schedule
Remote work mode
Ability to transfer unused vacation to the next year
Partial coverage of co-working costs
Regular online team-building events

Apply Back to all vacancies