Site Reliability Engineer (Night Shifts)
Middle
Full time
Position overview
We’re looking for a Site Reliability Engineer (SRE) to join the Fintech.Core platform team and take ownership of L3-level customer support and system reliability for a white-label fintech product deployed on customer-managed infrastructure.
The role focuses on monitoring, incident response, troubleshooting production issues, and close communication with clients during night-time support hours.
You will work with an AWS-based, Kubernetes-driven fintech platform built using Terraform, Helm, and modern observability tooling. As part of the role, you will also participate in configuring and maintaining monitoring dashboards and alerting to ensure system stability and fast incident detection. This position combines classic SRE practices with deep operational support responsibilities in a regulated fintech environment.
Projects description A modular white-label fintech platform based on microservices architecture designed for financial institutions (e.g. challenger banks, PSPs, trading and payment platforms). The platform is cloud-agnostic by design and currently runs on AWS, using Kubernetes, .NET Core, PostgreSQL, and Traefik, with observability tools like Prometheus, Grafana, and Loki. Infrastructure is provisioned via Terraform and follows a parametric model, enabling full environment deployment within a business day.
Current Use Case: A financial platform for a regulated institution with EMI and VASP licenses in the Philippines. The solution will combine digital banking and cryptocurrency services, including secure wallets, payments, remittances, and card issuing. A key part of the project is an in-house crypto trading module designed to provide a seamless and compliant experience for a growing global user base.
The size and the structure of the team
DevOps Engineer Backend Engineer ReactJS Engineer iOS Engineer Android Engineer Business Analyst QA Engineer Project Manager
YOUR BACKGROUND
- Hands-on experience with AWS, primarily EKS-based environments
- Strong knowledge of Kubernetes, including workloads, networking, ingress controllers, and Helm
- Experience working with monitoring and observability stacks: Grafana, Prometheus, Loki, Tempo
- Experience troubleshooting production incidents in distributed systems
- Solid understanding of Linux/Unix systems and operational debugging
- Experience administering and troubleshooting PostgreSQL
- Familiarity with CI/CD pipelines (GitHub Actions)
- Ability to read and understand Infrastructure as Code (Terraform, Helm)
- Confident English communication (written and spoken, B2 or higher) for direct customer interaction
- Ability to work independently during off-hours and make technical decisions
- English level - B2
Skills that will be a plus:
- Experience designing alerting strategies (SLOs, SLIs, noise reduction)
- Familiarity with Crashlytics or similar mobile monitoring tools
- Hands-on experience with Terraform
- Experience with Traefik ingress controller
- Experience with RabbitMQ
- Basic scripting skills (Python, Bash)
Responsibilities
- Provide L3 technical support for white-label customer environments deployed on their infrastructure
- Monitor production systems and respond to incidents during night shift coverage (approx. 02:00 - 10:00 EET)
- Act as the primary technical escalation point for customer-reported issues via dedicated communication channels
- Investigate and troubleshoot incidents across Kubernetes, infrastructure, CI/CD, networking, and databases
- Work actively with Grafana dashboards, alerts, and logs to identify root causes and prevent recurrence
- Design, configure, and maintain monitoring and alerting dashboards (Grafana, Prometheus, Loki, Tempo)
- Monitor mobile application stability via Crashlytics and correlate client-side crashes with backend incidents
- Collaborate with development teams to escalate bugs, propose fixes, and improve platform reliability
- Participate in incident postmortems and contribute to reliability improvements and runbooks
- Improve automation, observability, and self-healing capabilities of the platform
- Document incidents, troubleshooting steps, and operational best practices
WE OFFER
- 20 billable days off in the first year of cooperation, all next years - 25 billable days off
- Fair and competitive compensation
- Friendly team and enjoyable working environment
- Clearly described business processes in the company that really work
- Regular updates on company news, Q&A sessions with top management
- Flexible work schedule
- Remote work mode
- Ability to transfer unused vacation to the next year
- Partial coverage of co-working costs
- Regular online team-building events
YOU MAY FIND INTERESTING
Senior DevOps/Cybersecurity Engineer
APPLICATION FORM
Apply for this position now!
Send us your CV - we’ll contact you.





