Site Reliability Engineer (Night Shifts)

Middle

Dot

Full time

  • Clutch logo
  • Good firms logo
  • Inc5000 logo
  • AWS logo
  • ISO logo
  • Combinator startups logo

Position overview

We’re looking for a Site Reliability Engineer (SRE) to join the Fintech.Core platform team and take ownership of L3-level customer support and system reliability for a white-label fintech product deployed on customer-managed infrastructure.

The role focuses on monitoring, incident response, troubleshooting production issues, and close communication with clients during night-time support hours.

You will work with an AWS-based, Kubernetes-driven fintech platform built using Terraform, Helm, and modern observability tooling. As part of the role, you will also participate in configuring and maintaining monitoring dashboards and alerting to ensure system stability and fast incident detection. This position combines classic SRE practices with deep operational support responsibilities in a regulated fintech environment.

Projects description A modular white-label fintech platform based on microservices architecture designed for financial institutions (e.g. challenger banks, PSPs, trading and payment platforms). The platform is cloud-agnostic by design and currently runs on AWS, using Kubernetes, .NET Core, PostgreSQL, and Traefik, with observability tools like Prometheus, Grafana, and Loki. Infrastructure is provisioned via Terraform and follows a parametric model, enabling full environment deployment within a business day.

Current Use Case: A financial platform for a regulated institution with EMI and VASP licenses in the Philippines. The solution will combine digital banking and cryptocurrency services, including secure wallets, payments, remittances, and card issuing. A key part of the project is an in-house crypto trading module designed to provide a seamless and compliant experience for a growing global user base.

The size and the structure of the team

DevOps Engineer Backend Engineer ReactJS Engineer iOS Engineer Android Engineer Business Analyst QA Engineer Project Manager

YOUR BACKGROUND

  • Hands-on experience with AWS, primarily EKS-based environments
  • Strong knowledge of Kubernetes, including workloads, networking, ingress controllers, and Helm
  • Experience working with monitoring and observability stacks: Grafana, Prometheus, Loki, Tempo
  • Experience troubleshooting production incidents in distributed systems
  • Solid understanding of Linux/Unix systems and operational debugging
  • Experience administering and troubleshooting PostgreSQL
  • Familiarity with CI/CD pipelines (GitHub Actions)
  • Ability to read and understand Infrastructure as Code (Terraform, Helm)
  • Confident English communication (written and spoken, B2 or higher) for direct customer interaction
  • Ability to work independently during off-hours and make technical decisions
  • English level - B2

Skills that will be a plus:

  • Experience designing alerting strategies (SLOs, SLIs, noise reduction)
  • Familiarity with Crashlytics or similar mobile monitoring tools
  • Hands-on experience with Terraform
  • Experience with Traefik ingress controller
  • Experience with RabbitMQ
  • Basic scripting skills (Python, Bash)

Responsibilities

  • Provide L3 technical support for white-label customer environments deployed on their infrastructure
  • Monitor production systems and respond to incidents during night shift coverage (approx. 02:00 - 10:00 EET)
  • Act as the primary technical escalation point for customer-reported issues via dedicated communication channels
  • Investigate and troubleshoot incidents across Kubernetes, infrastructure, CI/CD, networking, and databases
  • Work actively with Grafana dashboards, alerts, and logs to identify root causes and prevent recurrence
  • Design, configure, and maintain monitoring and alerting dashboards (Grafana, Prometheus, Loki, Tempo)
  • Monitor mobile application stability via Crashlytics and correlate client-side crashes with backend incidents
  • Collaborate with development teams to escalate bugs, propose fixes, and improve platform reliability
  • Participate in incident postmortems and contribute to reliability improvements and runbooks
  • Improve automation, observability, and self-healing capabilities of the platform
  • Document incidents, troubleshooting steps, and operational best practices

WE OFFER

  • 20 billable days off in the first year of cooperation, all next years - 25 billable days off
  • Fair and competitive compensation
  • Friendly team and enjoyable working environment
  • Clearly described business processes in the company that really work
  • Regular updates on company news, Q&A sessions with top management
  • Flexible work schedule
  • Remote work mode
  • Ability to transfer unused vacation to the next year
  • Partial coverage of co-working costs
  • Regular online team-building events
Apply Back to all vacancies

YOU MAY FIND INTERESTING

Senior DevOps/Cybersecurity Engineer

Senior Full time
View
Cross icon

Got a project in mind?

Let’s explore how we can make it happen. Trusted by 100+ Fintech innovators.

Cross icon

Meet US at Sibos

Attending the top fintech event in Frankfurt? Let’s talk about your fintech project and how we can help bring it to life.