On Sunday November 3rd at 5:26pm CET(+1), all agents using Yuso’s back office started noticing :
- a slow / non responsive dashboard page
- an inability to book any new ride from any platform
All of these consequences had in fact the same cause which was that the directions API from Here was unresponsive.
After a first period of issues lasting from about 5:26 to 5:39, all seemed back in order and Here directions API seemed to be working fine. However, the same issue occurred again at precisely 6:39pm, at which point all users were switched to Google API (at ~ 6:43pm)
Incident severity level (SLA): S0
Time to detect service interruption :
- First occurrence : 10 minutes
- Second occurrence : ~ 3 minutes
Time to resolution :
- First occurrence : ~ 5 minutes (Here started working again after 15 minutes)
- Second occurrence : ~ 2 minutes
ROOT CAUSE ANALYSIS
Here directions API started malfunctioning : https://status.here.com/status
STEPS TO RESOLUTION
Steps taken to diagnose, assess, and resolve :
- 5:37 : Checking the back office pages, everything seemed ok.
- 5:38 : Checking AWS dashboard -- seeing high latencies on one of our service
- 5:39 : Upscaling our service from 3 to 6 processes. At this point, users were saying everything was back to normal.
- 5:40-5:55 : Analyzing the latencies on monitoring tools. At this point it became clear that the problem originated from Here directions API. Preparing a script to switch all offices to Google , and testing the Google switch on a test account.
- 6:39 : Users are having the same issue.
- 6:40 : Checking our monitoring tools : same metrics than when the problem occurred.
- 6:44 : Switching all office to Google.
LEARNINGS & NEXT STEPS
How do we prevent this issue from happening again?
- Improve alerting when Here API is unresponsive
- Implement a fallback