Over the past few weeks, there have been three widespread outages affecting all services across the Office 365 suite. Sporadic, limited outages are relatively common for services within the suite, but rarely does the entire service go down for multiple hours at a time.
- Looking at the outage that occurred on September 29, limited information was available at the time. Twitter became the main source of providing updates on the outage, due to the Office 365 Admin Centre becoming inaccessible.
- Some days after the outage, a more thorough investigation conducted by Microsoft found a code change to be the fault of the outage. Despite many systems in place to ensure code quality and prevent such issues, untested code was deployed the production servers.
- Similar causes have been cited for the two major outages in the weeks following this incident, and is calling into question how the software giant could be allowing rudimentary coding practices to slip and affect the hundreds of millions of users of the service.
While we as end users are powerless to provide remedies when the root of the outage runs so deep, it’s important to not become complacent with these cloud services’ reliability. This can also be the case with many aspects of your IT infrastructure.
There’s no doubt that modern systems have vastly improved in a multitude of aspects and assist in making your business as efficient and productive as possible. What happens when these systems go down needs to also be a careful consideration when it comes to business processes.
When it comes to something as far-reaching as Office 365, outages can certainly be difficult to plan for. Big players like Microsoft provide an SLA (Service Level Agreement) as a promise of reliability of their service, and compensation can be made available should the service be inaccessible for an extended period. While these “Service Credits” can take some of the sting out of the lost productivity or loss of income caused by extended outages, these are usually inconsequential when compared to the potential disruption from the loss of access to entire systems where workarounds or alternative tasks cannot be found.
For more isolated issues where a workaround is possible, the priority should become communicating this workaround to all users as quickly as you can to ensure minimal loss in productivity. It’s very easy for users to assume nothing can be done once their usual workflow is affected or interrupted.
But above all, choosing software and cloud system providers should be done under careful consideration to ensure incidents like these arise as little as possible.
If you need a second opinion in making the call on who your providers should be, or how to handle any potential outages that may occur, by all means get in touch with the Altitude Innovations team.