In March of this year, an influential online conference, DevOps fwdays’23, was dedicated to DevOps practices and tools. The speakers were developers and engineers from Softserve, Spotify, Luxoft, Snyk, Xebia | Xpirit, Solidify, Zencore, Mondoo, and others. Mykyta Savin, DevOps Infrastructure Architect at P2H, delivered a presentation “How we block production. Triangulate issue, fix and postmortem” on how a small mistake can block production and what to do in such cases. We are sharing this case, and hope this information will come in handy (although it’s better if it doesn’t).

Briefly about the product:

P2H develops an E-Government platform for a client from Saudi Arabia. The platform’s work is related to facilitating interaction with the labor market, and the product’s target audience is the country’s citizens and businesses. The development has been ongoing for several years and is constantly changing and expanding with new services. The platform is based on an asynchronous architecture, considering the peculiarities of working with integration points in the Saudi Arabian government.

Tech Stack and processes:

  • Microservice architecture.
  • Front end: Vue.js, React.js.
  • Back end: Ruby, Ruby on Rails, Java, PHP.
  • Message broker: RabbitMQ.
  • Global cache: Elasticsearch.
  • Infrastructure: Docker.
  • Monitoring, observation, and tracing: Grafana, Grafana Loki, Grafana Tempo, Prometheus, OpenTelemetry, Vector.
  • Integrations: IBM APP Connect, IBM API Connect, Absher, Unifonic, mada, SADAD, and more.

The project is based on a microservice architecture. Over a hundred microservices are currently in production, most of which are written in Ruby. New microservices are being launched in Java. The Enterprise Service Bus (ESB) pattern and the RabbitMQ message broker were chosen to organize the project’s asynchronous nature. The storage layer is built on Elasticsearch and PostgreSQL, and the infrastructure uses Docker, Docker Compose, and an internal provider from Saudi Arabia to meet the data locality requirements of the government regulator. Grafana Stack is used for monitoring, along with numerous integration points with various ministries and private institutions. RabbitMQ functions as a cluster of four nodes accessible through the prod-rabbit-new-lb load balancer.

Problem identification and team actions

The problem was identified through automated alerts on Prometheus Alarms, specifically:

  • Average page processing time fired.
  • Number of gateway timeouts fired.

The operations team immediately manually verified the alerts and found that the website performed very slowly. As a result, an incident was opened, and a war room was formed, including the operations team, L3 support, and representatives from the service owner. The problem escalated rapidly, as after 15 minutes from detection, the service practically stopped responding. The service owner had to switch it to maintenance mode and restrict client access.

Triangulation

As you understand, the search was under pressure from the customer’s management and was quite intense. To work with such problems, we had a short checklist with fairly obvious but effective items.

Anamnes checklist:

  • Have there been any recent system changes?
  • Have there been any recent deployments?
  • Is RabbitMQ (RMQ) functioning properly without any overloads?
  • Have there been any unusual entries in the logs or monitoring system lately?
  • If needed, systematically check all system components.

Since the project’s architecture is based on ESB, it is very sensitive to the specifics of RMQ’s operation, the verification of which is one of the first items on the checklist. If RMQ works well, it is worth checking whether there are no overloads and what resources are being used.

We immediately noticed that the load balancer for RMQ uses a relatively large amount of traffic, which is not ok. In normal mode, the load balancer uses about 30 Mbit of traffic per second, but 300 Mbit was used here.

At first, we did not pay attention to the fact that the number is very “flat”. We spent time on a (quite feverish) search through the system – which is generating traffic in RMQ, where are the messages coming from, and why is this not visible from the monitoring.

Having spent 20 minutes searching for the source of the traffic, we again returned to the load balancer and were interested in the fact that for 20 minutes, the traffic continued to be 300 megabits in both directions. We checked the characteristics of the port – bingo! Three hundred megabit port. Something is somehow eating up all the bandwidth on the load balancer’s network port!

So we determined that something was using all the bandwidth for RMQ, which was one of the problems that caused the system to fail.

We began our investigation from this point: the number of messages in the queues appeared normal, but the count of messages taken from the queues and re-queued was close to zero. In other words, the RMQ cluster was occupied by something utilizing the entire bandwidth, but new messages were not being enqueued, and old ones were not being removed. It resembled a cache poisoner scenario, so we started digging deeper. There was nothing unusual in the logs, but the RMQ control panel indicated a slightly higher number of unacknowledged messages, which caught our attention.

As you may know, RMQ has several types of messages. We used the type that requires acknowledgment from the client, meaning the client processes the message and informs RMQ about it. If, for some reason, the client fails to acknowledge the processing to RMQ, the message is not deleted from the queue. After a certain period, RMQ returns it back to the queue and hands it over to another client. This mechanism is in place to ensure that messages are not lost and will eventually be processed by someone. Our situation seemed to align with this behavior, so to test our hypothesis, we decided to find the service responsible for it.

Problem Resolution

After a thorough investigation of the Node exporters, we identified an instance that was generating an unusual amount of traffic, matching the traffic received by the RMQ load balancer. Within this instance, we discovered a group of containers that were constantly restarting, causing the service to restart repeatedly. Upon further investigation, we found that the service was not restarting but rather being killed by the OOM Killer (out-of-memory killer) and then automatically restarted.

Since the service was built with Ruby, which by default in the Sneakers framework performs prefetching of multiple messages from RMQ, it appeared that the service was prefetching very large messages. It resided in memory exceeding Docker’s limits, resulting in the container being killed. Consequently, the connection to RMQ was lost, and RMQ re-enqueued the preselected (prefetched) messages, delivering them to a new service instance after the restart but to a different one attempting to read from the same queue. As a significant number of such messages accumulated, we experienced cache poisoning within RMQ, with the entire bandwidth being occupied by the processing and re-enqueuing of prefetched messages.

As a result:

  • Short-term solution — manually increasing the memory limits in Docker allowed the system to resume normal operation and recover within 10 minutes. The incident was fully resolved in approximately one and a half hours after its occurrence.
  • Long-term solution — further investigation and analysis of the service were conducted to identify the source of generating large-sized messages and address the issue. The solution was implemented on the same day.

Conclusions

To prevent similar issues in the future, we made improvements to the monitoring system and started collecting metrics to track:

  • Services experiencing frequent restarts with corresponding alarms.
  • Docker containers are being killed by the OOM Killer.
  • Processes in the system are being killed by the OOM Killer.
  • Additionally, we started collecting metrics related to the message size in RMQ and implemented alarms to alert us if a message size exceeded certain thresholds.

Since then, we have not encountered similar problems. The metrics related to container restarts due to the OOM Killer have ceased, allowing us to focus on other system issues, and overall, everything has been running smoothly and without incident.