Backend incident
Incident Report for Yuso
Postmortem

SUMMARY

On the 2nd of July, we experienced issues to display drivers details and statistics via Yuso main backend. This didn’t not impact the riding dispatch process.

ROOT CAUSE ANALYSIS

The database containing drivers statistics started to respond slowly on the 1st of July. One of the index became too big and started overloading one of the database node, increasing the response time. To cope with that overloading, we upgraded the cluster on the 1st of July to ensure the response time would be back to normal, and to let us work on the root cause.

Unfortunately we couldn’t perform the upgrade completely on the 1st of July due to the big index. This caused the cluster to stay in a temporary state.

On the 2nd of July, the load increased due to the business activity, we then experienced the same issue, leading our clients to not being able to display drivers details on our main backend.

We first tried to clean up the index to reduce its size, but this operation was too slow. We then switch to a new cluster with minimal data to ensure the response time was back to normal, and started to transfer necessary data.

STEPS TO RESOLUTION

Steps taken to diagnose, assess, and resolve:

  • 8h30 - Notified about the issue by our error monitoring
  • 8h35 - Started investigation and linked the issue to the cluster state
  • 9h03 - First call from Marcel for some slowness when displaying drivers details
  • 9h30 - Started index clean up of old documents
  • 9h43 - Switched to a new index to unload the current index
  • 12h58 - Decreased timeout when calling drivers statistics on our main backend
  • 13h07 - Set up a new cluster
  • 13h40 - Switched microservices to the new cluster and started data transfer
  • 15h01 - Switched the last microservice on the new index

LEARNINGS & NEXT STEPS

This issue was known as this index has been growing consistently since its first release. Though we did not think we would face this situation so soon. This feature was released in 2018 and had a design misunderstanding we couldn’t deal with without heavy work. This wasn’t an issue until we came close to a critical size threshold.

We now have a clean configuration on which we can easily work on, and a smaller set of historical data. We decided to reduce our historic time window, a huge one being useless for our clients.

The last step is to work on a automatic rolling index to ease and speed up our maintenance process. By working on smaller chunks of data, we will not only decrease the overall response time, but it will also greatly reduce our maintenance toil.

Posted Jul 06, 2020 - 11:08 CEST

Resolved
The issue is now resolved.

Drivers acceptance statistics and connection time are nearly up to date as the transfer process is reaching its end.
Posted Jul 03, 2020 - 06:33 CEST
Update
Latency issues have been resolved, the dashboard operations are now back to normal.
In the meantime, the drivers acceptance statistics and connexion time are unavailable or invalid.
We are currently fixing this issue and monitoring the situation.
Posted Jul 02, 2020 - 14:01 CEST
Monitoring
Latency issues have been resolved, the dashboard operations are now back to normal.
In the meantime, the drivers acceptance statistics and connexion time are unavailable or invalid.
We are currently fixing this issue and monitoring the situation.
Posted Jul 02, 2020 - 14:01 CEST
Identified
The issue has been identified very quickly after the start of the investigation: the database used to store drivers statistics is responding very slowly.
We are still taking actions to solve the root cause, but in the meantime we deployed a fix to mitigate the slowness encountered by the users.
Posted Jul 02, 2020 - 13:14 CEST
Investigating
We're experiencing an incident on our Yuso backend and are currently looking into the issue.

Do not hesitate to contact us for further details at help@yusofleet.com.
Posted Jul 02, 2020 - 10:55 CEST
This incident affected: Back Office (Dashboard).