Ensuring High Availability and Fault Tolerance In Fintech Applications

Artur MorozovAndroid Team Lead at DashDevs

Artur Morozov is an Android Team Lead with over 6 years of experience in programming, databases, and mobile hardware. His expertise spans Android app architecture, design principles, and Kotlin development. Artur is well-versed in Agile, Scrum, and Kanban practices, known for his analytical thinking and attention to detail. He excels in collaborative settings, consistently delivering successful project outcomes.

SOFTWARE DEVELOPMENT PRODUCT MANAGEMENT MOBILE APP DEVELOPMENT FINTECH

FEBRUARY 22, 2025

11 min read

The true cost of a mobile app downtime can amount to $9,000 per minute for large organizations. Stakeholders, business owners, and managers, utilize multiple metrics to evaluate the performance of an application, especially with high downtime costs in mind. Two such are high availability and fault tolerance.

In this post, you’ll explore the concepts of high availability and fault tolerance in fintech app development. You’ll find out what they mean for app performance and business as well as how to measure them. Besides, you’ll review expert practices on how to enhance high availability design and improve fault tolerance systems, all manifested in concrete case studies by DashDevs.

What Are High Availability and Fault Tolerance?

The ability of a system to remain operational and accessible with minimal downtime, even during failures or high demand.

In mobile fintech apps, high availability systems may mean, among others, the following:

Users can access financial services 24/7 without disruptions.
Critical functions like payments, transfers, and authentication always work.
The system automatically handles failures through redundancy and load balancing.
Service continuity is maintained even during updates or infrastructure issues.

LOOKING FOR TECHNICAL SUPPORT AND MAINTENANCE ASSISTANCE?

Entrust to ensure high availability and fault tolerance in your best fintech app to the DashDevs team.

At the same time:

Fault tolerance is a measure of how well a system can tolerate continuous operation disruptions and recover from failures as well as the ability to provide users with information on errors and help to resolve them.

In mobile fintech apps, fault tolerance means:

Errors are handled in a user-friendly manner, e.g., clear messages, retry mechanisms, offline mode, etc.
Redundant systems automatically take over in case of failures.
Self-recovery mechanisms prevent small issues from escalating.

While availability refers to how well an app performs and avoids disruptions in general, fault tolerance is more centered around resolving issues that already exist. Oftentimes, availability is addressed on the back-end development level, while fault tolerance is ensured by means of front-end development.

Pro insight: Frankly speaking, high availability and fault tolerance are two parameters that correlate with each other and work towards the same goal — making the app usable and user-friendly. However, ensuring high availability and high fault tolerance requires development resources. Therefore, sometimes these processes compete for resources, making decision-makers prioritize one or another.

How Do High Availability and Fault Tolerance Impact an Android Fintech App?

Probably, I’ve already highlighted the role of high availability and fault tolerance enough. But on what business aspects exactly do these two, rather technical metrics, have an impact? The list goes as follows:

User satisfaction. Minimal disruption and helping users handle errors leads to fewer complaints and concerns and increased satisfaction.
Performance. Users expect an app to operate in a smooth and streamlined manner without much regard for load. High availability and fault tolerance are exactly what fulfills that demand.
Competitiveness. Fintech customers expect seamless financial transactions and frequent downtimes can push them toward competitors.
Revenue. Downtime and transaction failures lead to lost revenue, refund requests, and lower customer retention. At the same time, the resilient app minimizes financial losses.
Scalability potential. Maintaining high availability and fault tolerance allows the app to handle increased traffic and transaction volume without performance degradation. This supports future expansion.

Important: At this point, you may have already formed an insight that ensuring full availability is a must, especially for an already released, well-maintained application. However, that’s not entirely correct. While developers strive to ensure the highest availability possible, making an app 100% fault-proof for 100% of users is not realistic.

Even the most well-built applications maintained by industry leaders face critical faults, not even considering minor back-end issues. As such, the ChatGPT bot has faced multiple massive failures, so the developers even had to add a dedicated OpenAI status page so users may know where the issue is on their side or on the developers’ side. And that’s just the tip of the iceberg.

How Can High Availability and Fault Tolerance Be Measured?

When it comes to high availability vs fault tolerance, what data should be gathered, and what metrics should be calculated to support decision-making exactly? Let’s find out:

Application Uptime: Availability Metric

Uptime is the percentage of time a system, service, or application is available and operational without interruptions.

Uptime is typically measured over a given period (e.g., a month or a year) and is calculated using the following formula: Uptime metric

When an app is in development, uptime is typically calculated as a part of load tests. When monitoring an already published application, developers typically use Firebase performance monitoring functionality by Google.

Good uptime value lies within 96-97% range. 99% or higher uptime is desirable. Here at DashDevs, we ensure that uptime in applications is no less than 99%.

Time to Recover (TTR): Availability Metric

Time to Recover (TTR), also known as Mean Time to Recover (MTTR) — is the average time required to restore a system to full functionality after a failure or downtime incident.

The formula for TTR is the following: Time to Recover (TTR): Availability Metric

Number of incidents, whether critical or all incidents, can be recorded in numerous, both manual and automated ways. Developers often perceive the number of new support tickets created over a particular period as the number of incidents, or analyze data from Firebase Crashlytics

Are you in need of a trusted dev partner to help with quality assurance and software testing? Reach out to DashDevs at our respective service page.

Mean Time Between Faults (MTBF): Fault Tolerance Metric

Mean Time Between Failures (MTBF) is a reliability metric that measures the average time between system failures in a given period.

Now, let’s see how to calculate MTBF example: MTBF

Similar to the case with measuring TTR, data for MTBF is retrieved either manually by calculating the number of tickets or by retrieving data from Firebase Crashlytics. When the app is still in development, to calculate MTBF, a number of unsuccessful test cases is used.

Pro tip: An important aspect to consider here is that some teams may need to calculate several different MTBF metrics. One for all failures and a separate one for critical failures only. This approach provides a more complete insight, facilitating high-level decision-making.

Depending on the nature and complexity of an application, good MTBF calculated for critical failures shouldn’t be less than 1 year.

Crash-Free Rate: Fault Tolerance Metric

Crash-Free Rate is the percentage of users or sessions that do not experience a crash within a given period.

Two common ways to measure the crash-free metric are to measure the percentage of crash-free users or the percentage of crash-free sessions: Crash free metric

For both crash-free metrics, an optimal value lies within 96-97%. Here at DashDevs, again, we maintain this metric at more than 99% level for all our applications.

For a more complete picture, some analysts may diversify crashes into categories, like critical or non-critical, or even group them into types. Again, doing so may provide more value for decision-making sessions.

High availability largely depends on the software testing instruments in use. Take a read of our article on the best test automation frameworks for complete insight.

Best Practices to Ensure High Levels of Availability and Fault Tolerance

Above everything, in every DashDevs project, we utilize mechanisms focused on ensuring high availability and fault tolerance levels. We also measure the metrics as a part of standard testing procedures. However, to make the best practice section both useful and concise, I provide here concrete examples from only two out of our many successful stories, referring to them on multiple occasions. The example cases are:

#1 Al.ko: An application service for a transportation company AI.ko case study For AI.ko, the DashDevs team created a foolproof software solution for sobriety detection. We implemented a centralized system for alcometer and AR tool management. The app also supports advanced biometrics and facial recognition technologies, making it perfect for automated checks.

Liveness check integrated into the app as well as other real-time functionalities, especially those connected via Bluetooth, are considered complex features. It means that they are vulnerable to disruptions. That’s why in this success story, we invested substantial resources to maximize availability and fault tolerance for optimal user experience.

Explore AI.ko case

#2 An all-in-one Super Taxi and Delivery application In this case, which is basically a super app development, we created a new design system for the brand and software, as well as added combined functionalities from two separate applications by the same company. We also developed a range of modules on the top of it.

Throughout the project, our team conducted a substantial rework of how the app looks and feel, including how well it handles failures and errors. We also included some of the industry-best fault tolerance mechanisms, as detailed below.

Explore taxi and delivery app case

Finally, here are our pro practices on high availability and fault tolerance:

How to Ensure High Availability in Mobile Android Apps

#1 Choose system architecture wisely

High availability systems depend more on the backend design rather than front-end. It’s complicated to claim whether one or another architecture pattern is superior. However, from my experience, most fintech apps that are built to ensure failures have microservices architecture.

In the abovementioned Super Taxi and Express Delivery App, we utilize the microservices architecture pattern. It helped us to ensure high system availability, resulting in uptime >99.9%.

An additional bit of advice concerning microservices projects only is to ensure that critical microservices are independent to prevent cascading failures.

#2 Include offline mode support

It’s often best to support the option to use local data for some functionalities if real-time updates are not crucial. In the Transportation App, we integrated local caching, enabling offline functionality for most features. As an outcome, this largely improved user experience in cases of poor connectivity conditions.

#3 Test app load and utilize load balancing mechanisms

One of the best ways to minimize disruptions is to distribute traffic efficiently to prevent overloads. The scope of measures aimed towards this goal is known as load balancing measures. Naturally, the efficiency of such measures is to be tested.

In both the AI,ko and Super Taxi & Delivery App, we conducted load testing along with automation, integration, and unit testing, and implemented strategies to ensure smooth traffic distribution.

#4 Prioritize security from the beginning

An overlooked aspect contributing to high availability is cybersecurity. As such, protection against malware and other attacks is also a solid measure to utilize when availability is priority.

You may be interested in exploring banking cybersecurity threats and challenges in another blog post by DashDevs.

While we can’t explicitly disclose security measures we utilize in one or another of our projects, we may claim that we leverage the entire spectrum of security and dedicated failure detection mechanisms, from KYC and secure authentication to built-in cyber attack protection.

Pro tip: I highly recommend only critical or most frequent faults as a priority for performance improvement and fault tolerance initiatives. It’s especially important for smaller-scale projects with tight budgets not to overuse development resources, as on the MVP stage, any app inevitably faces occasional errors, and the actual question here is what to fix to balance resource utilization and not compromise performance.

IN NEED OF CONSULTING AND DEVELOPMENT ASSISTANCE?

Let the teams of expert fintech developers from DashDevs contribute to your best project

How to Increase Fault Tolerance in Mobile Android Apps

#1 Implement optimizations in the front-end

In contrast with high availability, fault tolerant apps are mostly created through front-end development and UX design. The scope of actions to utilize here is vast, from optimizing user flow to utilizing correct button sizes. That’s why I’m not elaborating on this aspect any further. Let me just remind you that having proper software instruments to retrieve data here is essential.

Explore how to optimize the design in software solutions and platforms for the best user experience and fault tolerance in a corresponding blog post by DashDevs experts.

Returning back to DashDevs’ success stories, we integrated Firebase Crashlytics in both projects. Due to the insights we obtained and improvement measures we took, we obtained the following results:

A crash-free rate of >98% for the AI.ko application.
A crash-free rate of >99%, including non-fatal error tracking, for the Super Taxi & Delivery App

#2 User-friendly messaging

This practice is simple to understand yet difficult to apply. It boils down to the notion that developers are expected to provide clear, localized error messages explaining the issue. For example, if an app fails to connect to the server because of a connectivity issue, a user should be displayed this very notification.

#3 Guided error resolution

If an error is fixable, then for the best fault tolerance, an app should suggest actions to resolve it. Overwise, an app should provide an opportunity to contact the support center within the integrated messenger in just a few clicks.

Continuing on the example with the server connection issue, a user should not only be displayed the issue notification but also provided with instructions on how to resolve the issue. For instance, it’s possible to advise a user to check their internet connection. Maybe even guide them to their device “Settings’ so they can check whether they have an internet connection turned on.

Here at dashDevs, we extensively implement error handling and refresh mechanisms so any failures users may experience are less disruptive.

#4 Retry mechanisms

Temporary failures are a frequent occurrence in most fintech mobile applications that rely on real-time server updates. Normally, they are caused by short internet connection disruptions. In that regard, I highly recommend implementing automatic and manual retry mechanisms to mitigate the issue.

As such, in the Super Taxi & Delivery App, we implemented automatic background refreshes, ensuring seamless fault recovery without interrupting the user experience.

Pro tip: I recommend not allowing technical debt, which arises on the back end in many teams, to accumulate in fault-tolerant systems. Additionally, continuous UX design improvement is also a strong measure to enhance fault tolerance.

Final Take

When it comes to fintech app development, ensuring fault tolerance vs high availability are both crucial matters. While high availability minimizes downtime through redundancy and load balancing, fault tolerance ensures systems recover from failures efficiently. Prioritizing both requires a strategic balance of backend and frontend optimizations. While achieving 100% reliability is unrealistic, proactive monitoring, error handling, and automated recovery mechanisms significantly enhance system resilience.

Having the right team is half the battle. Here at DashDevs, we possess substantial knowledge of how to design and develop mobile apps with high availability and fault tolerance in mind. With more than 15 years on the market and over 500 projects under our belt, we may offer you consulting assistance as well as flawless tech execution.

Author

ARTUR MOROZOV Android Team Lead at DashDevs

Table of contents

FAQ

What is high availability and fault tolerance?

High availability (HA) is a metric showing whether a system operates with minimal downtime by using redundancy, load balancing, and failover mechanisms. Fault tolerance (FT) is an indicator of a system’s capability to continue functioning despite hardware or software failures by implementing real-time error detection and automatic recovery. Both improve system reliability but differ in approach and complexity.

What is the difference between fault tolerance and high availability?

High availability (HA) ensures a system operates with minimal downtime by using redundancy, load balancing, and failover mechanisms. Fault tolerance (FT) allows a system to continue functioning despite hardware or software failures by implementing real-time error detection and automatic recovery. Both improve system reliability but differ in approach and complexity.

What are the two types of high availability?

High availability can be active-active or active-passive. In an active-active setup, multiple nodes operate simultaneously, sharing the load and ensuring continuous service if one fails. In an active-passive setup, a primary system runs operations while a backup remains idle until a failure occurs, then takes over.

How do you calculate MTBF?

Mean Time Between Failures (MTBF) is calculated by dividing the total operating time by the number of failures. If a system runs for 10,000 hours and experiences five failures, the MTBF is 10,000 divided by 5, which equals 2,000 hours. A higher MTBF indicates better system reliability.

How to ensure high availability and fault tolerance in mobile Android apps?

Ensuring high availbitliy and fault tolerent in Android apps requires redundant servers, automated failover, and load balancing to minimize downtime. Error handling, checkpointing, and self-healing mechanisms allow the app to recover from failures. Cloud-based services and data replication further protect against service disruptions.