2 min read

Resolving a performance bottleneck through service throttling

Anees Noor : Sep 22, 2025 11:00:00 AM

Banking

Industry

Banking

Background

Within a complex system architecture, significant slowdowns occurred in the critical login service during routine operations. The objective was to analyze and resolve these performance bottlenecks to ensure system stability and user experience.

Challenge

The root cause of the login service degradation was initially unknown. Analysis revealed that the service itself was not at fault; instead, a seemingly independent internal service triggered a cascading effect by sharing a database resource, thereby compromising the critical login process.

Performance

Conducted a comprehensive root cause analysis using load testing across various scenarios (isolated and simultaneous). Formulated and validated a hypothesis, followed by the implementation of an effective solution via targeted rate limiting (throttling) for the non-critical service and continuous monitoring of system metrics.

Benefit

The login response time was improved by 30%, and critical errors (500/503) were reduced by over 60%. Database utilization was stabilized, enhancing overall system stability and ensuring that business-critical workflows remain reliable even under heavy load.

In complex system architectures, performance bottlenecks can often occur in unexpected places. This case study documents the analysis and resolution of significant slowdowns in the critical login service AuthService. The root cause analysis did not lead to an error in the service itself, but to a seemingly independent, internal service that triggered a cascade effect by sharing a database resource. The following report describes the methodological approach used to identify the problem, the strategy used to validate the hypothesis and the implementation of an effective solution through targeted rate limiting.

Slowdowns were observed in the system's login service, AuthService, during routine operation. Upon further investigation, it was determined that the problem was related to an internal service called ActivityQueryService, which is frequently used by employees to retrieve customer activity logs.

Although both services function independently at the application level, a shared backend resource was being retrieved - the audit database.

With increasing concurrent use of ActivityQueryService, a cascading effect on system performance was triggered, which noticeably degraded login response times in particular. While the functionality of ActivityQueryService was ensured, the behavior of the service under load was found to affect more critical workflows such as user authentication.

Symptoms

Audit database connection pool exhaustion: It was observed that the DB connection pool reached its limits during peak utilization and blocked incoming requests.
Increased 500/503/504 errors: A significant increase in these errors was observed during peak loads.
Cross-service performance degradation: Login flows, although not directly dependent on ServiceEventInq, were affected by noticeable timeouts and slowdowns.

Validation strategy

To validate and isolate the issue, a comprehensive series of load tests were designed and executed:

Scenarios before and after a fix (pre-fix vs. post-fix)
- Isolated execution of AuthService
- Isolated execution of ActivityQueryService
- Simultaneous execution of both services
Monitored metrics
- Average response time per API
- Maximum and average number of DB connections used to the audit database
- Error rates and types
- Throughput (requests/sec)

All tests were conducted in a controlled environment to ensure repeatability. Metrics and results were monitored using monitoring dashboards and backend logs.

Solution: Introduction of a limit of 10 requests

A fixed limit of concurrent ActivityQueryService requests per node was implemented. Requests that exceeded this threshold were either queued or responded to with controlled error messages.

This customization ensured that:

Connections to the audit database were maintained and not monopolized by a single service.
Resource access for critical services such as login was guaranteed.
System-widecontention was proactively mitigated.

Results

The login response time was improved by 30%.
The use of the audit DB was stabilized.
- The maximum number of connections used fell by 50%.
- The average number of connections used fell by over 45%.
The error distribution showed an improvement.
- 500/503 errors at login were reduced by over 60%.
- A slight increase in ServiceEventInq errors was observed due to the introduced limit (expected behavior).

Trade-offs

Although the response times of ActivityQueryService almost doubled under load, this trade-off was deemed acceptable to ensure overall system stability. This increase in response time was an expected consequence ofthrottling andqueuing, but did not affect any business-critical workflows.

Conclusions for developers & architects

Throttling is not the same as slowing down: Intelligent rate limiting helps to maintain core functionalities and avoid overload.
Shared backend resources can act as hidden dependencies: Understanding shared components is essential for a robust architecture and to avoid side effects.
Tests should reflect realistic user flows: Simulating concurrent, real-world usage scenarios promotes insights rather than running tests in isolation.
Controlled errors are preferable to system-wide failures: Allowing non-essential processes to fail in a controlled manner maintains system integrity.

The key message of this case study is that sometimes the most effective way to accelerate system performance is to strategically slow down certain components.

Get in touch.

We know all about load and performance testing.

Case Studies

Field test of e-charging applications throughout Europe

Case Studies

Field tests to optimize e-charging live data Imagine you are driving your electric vehicle on the highway and realize that you will soon need a...

Automotive

Testing a frequent flyer program's reward structure

Case Studies

Strategic realignment of a frequent flyer program A major program to redesign the rewards structure of a frequent flyer program was implemented. The...

Airline

Optimization of the testing process in the aviation industry

Raphael Bisinger

Optimizing the testing process in the aviation industry: An end-to-end testing strategy for complex systems

Airline

Services

Leistungen

Leistungen

Case Studies

Resolving a performance bottleneck through service throttling

Industry

Background

Challenge

Performance

Benefit

Symptoms

Validation strategy

Solution: Introduction of a limit of 10 requests

Results

Trade-offs

Conclusions for developers & architects

Get in touch.

Case Studies

Field test of e-charging applications throughout Europe

Testing a frequent flyer program's reward structure

Optimization of the testing process in the aviation industry