On the 2nd of July, we experienced issues to display drivers details and statistics via Yuso main backend. This didn’t not impact the riding dispatch process.
The database containing drivers statistics started to respond slowly on the 1st of July. One of the index became too big and started overloading one of the database node, increasing the response time. To cope with that overloading, we upgraded the cluster on the 1st of July to ensure the response time would be back to normal, and to let us work on the root cause.
Unfortunately we couldn’t perform the upgrade completely on the 1st of July due to the big index. This caused the cluster to stay in a temporary state.
On the 2nd of July, the load increased due to the business activity, we then experienced the same issue, leading our clients to not being able to display drivers details on our main backend.
We first tried to clean up the index to reduce its size, but this operation was too slow. We then switch to a new cluster with minimal data to ensure the response time was back to normal, and started to transfer necessary data.
Steps taken to diagnose, assess, and resolve:
This issue was known as this index has been growing consistently since its first release. Though we did not think we would face this situation so soon. This feature was released in 2018 and had a design misunderstanding we couldn’t deal with without heavy work. This wasn’t an issue until we came close to a critical size threshold.
We now have a clean configuration on which we can easily work on, and a smaller set of historical data. We decided to reduce our historic time window, a huge one being useless for our clients.
The last step is to work on a automatic rolling index to ease and speed up our maintenance process. By working on smaller chunks of data, we will not only decrease the overall response time, but it will also greatly reduce our maintenance toil.