Backend incident
Incident Report for Yuso
Postmortem

SUMMARY

On the 4th of July, we experienced issues to dispatch immediate ride for our Taxi/VTC customers. All automatic dispatches performed by our system ended up failing.

ROOT CAUSE ANALYSIS

Our dispatch system relies on one of our provider to compute ETA for drivers, when a immediate ride is created. When the provider cannot answer, the dispatch system fails and the ride is not dispatched automatically to a driver. This means that a back office agent has to manually dispatch the ride. This behaviour applies to VTC and Taxi companies.

Due to the outage of our provider, all our calls from our back end system failed, thus making the dispatch system fail too.

For more information on the outage, see:

https://status.here.com/status?id=status_history_details&sys_id=80480f2d1bbd50540bfa42eddc4bcb0c

STEPS TO RESOLUTION

Steps taken to diagnose, assess, and resolve:

  • 16h19 - Call from one of our client to say they have to perform the dispatch manually for immediate rides
  • 16h34 - Issue identified, outage ongoing for our maps provider HERE
  • 16h54 - Switch to another provider to re enable the automatic dispatch system
  • 17h50 - Second call from our client to complain about the issue occurring again
  • 18h10 - Identification of a regression in our source code
  • 18h16 - Force switch on our second provider
  • 19h01 - Outage ended for our provider HERE
  • 21h30 - Switch back on HERE

LEARNINGS & NEXT STEPS

Even if our providers have proven they are really reliable, we cannot afford to let our dispatch system fail when it doesn’t get an answer. Our monitoring system alerted on some issues with a part of our back end, but we had to investigate a bit to discover that our provider was experiencing an outage.

When we first switch from a provider to another, we were confident in the fact that the dispatch system would be operating normally again. But it appears a regression has appeared and modified the source code behaviour. This should not happen.

We therefore have three tasks to enhance our systems:

  • Add some alerting on our provider to be alerted when an outage occurs, in order to react even quickly
  • Fix the regression and improve our testing mechanisms
  • Set up a fallback on our dispatch system to prevent it from failing, to either call another provider or compute a rough ETA
Posted Jul 06, 2020 - 16:12 CEST

Resolved
This incident has been resolved, all systems are operating normally.
Posted Jul 04, 2020 - 21:33 CEST
Monitoring
We have deployed a fix and switched some of our processes to another provider.

We are monitoring the issue.
Posted Jul 04, 2020 - 17:27 CEST
Identified
One of our provider is currently experiencing a major outage on its system (https://status.here.com/status), on which we rely to compute our ETA for immediate ride.

We are currently working on a fix.
Posted Jul 04, 2020 - 16:42 CEST
Investigating
We're experiencing an incident on our Yuso backend and are currently looking into the issue.

Do not hesitate to contact us for further details at help@yusofleet.com.
Posted Jul 04, 2020 - 16:34 CEST
This incident affected: Back Office (Dispatch System).