Our primary database server suffered a hardware failure and upon restarting the database was unable to start.
We managed to restore the database to full operations without any data loss, however it required a slow and manual process.
Typically in a situation like this one of our backup databases, which are running 24/7 with live up to date data, would have kicked in. However the issue that affected the primary database also got passed through replication to the backups and corrupted them as well, preventing these from running.
We do run frequent snapshots of all databases so could have restored from a recent back, but this was not required.
We are conducting a full review to determine how we can prevent a similar combination of simultaneous extreme issues causing problems in the future.
]]>2 separate incidents have been identified over 24 hours and both resolved. Measures have been put in place to prevent any issues reoccurring.
]]>The data center where our servers are located suffered from multiple redundant power source failures which included the backup systems in place. Unfortunately this affected both our primary and backup systems taking our systems offline.
During the outage we were able to get temporary systems up and running with the previous days offsite backups and were able to offer this to customers who required urgent access to view their bookings.
Since this incident we have reviewed and upgraded our hosting infrastructure to include backup systems running 24/7 in 3 different continents. These systems are synchronised with our primary systems so could take over operation instantly with no risk of data loss. This is in addition to the multiple layers of backups that we already operate to keep our customers data safe.
Read more about this incident here https://transporters.io/incident-report-11th-april-2017/
]]>