Operational Incident beginning 20:40 on Wed 30 Aug 2023

Last updated: Thu 7 Sep 2023 11:00 AEST

On Wed 30 Aug 2023 (beginning ~20:40 NEM time) we experienced a significant disruption to a number of our systems based in Azure, and this flowed through to impacts on a number of our clients.

In the following timeline we highlight the key milestones in the event and link to further information.

Wednesday 30th Aug 2023

18:41

From the Microsoft ‘Preliminary Post Incident Review’ we now know that:

Starting at approximately 08:41 UTC on 30 August 2023, a utility power sag in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones.

For further details of this Microsoft report are mentioned in the timeline below.

~20:40

First impacts on our systems detected

20:55

First GR team member aware of these issues and begins to investigate.
More team members subsequently involved in investigations (though the cause was upstream from us).

21:19

Global-Roam status page updated to reflect the interruption to our systems and products, and included the following description from Microsoft:

We are experiencing impact related to a cooling issue for a sub-section of a single data centre in the Australia East region. This is resulting in connectivity and availability issues for some Storage and Compute resources in this region. Additional Azure services with dependencies on these resources may also experience impact related to this. We are actively working onsite to mitigate the cooling issue, and updates will be provided in an hour or as events warrant.

21:30

First update from Azure:

Azure Services - Australia East - Investigating We are currently investigating an issue impacting Azure Services in Australia East. Further details will be provided shortly. This message was last updated at 11:30 UTC on 30 August 2023

21:52

Global-Roam status page updated with a summary of the incident and current state:

We are experiencing a major failure due to upstream Microsoft Azure hosting issues in Australia East region. Many systems and products are currently affected. We have assessed our recovery options and unfortunately we have few options at this time and are awaiting a resolution from Microsoft.

22:35

First email from AEMO:

INC0118077 - Alert - Issue with Azure Australia East Region causing impact to multiple services

… noting that …

AEMO has declared a major incident for the applications hosted Azure Australia East Region. This is an issue at Microsoft end which is being investigated by Microsoft Support.

Thursday 31st Aug 2023

06:20

Global-Roam status page updated with:

Microsoft have been working to restore services but many critical parts of our infrastructure are still down. We are actively working to reassess and work around the issues.

06:36

This article posted on WattClarity® with some details:

Power surge affects Azure and hence AEMO (and GR)

… and circulated via social media channels.

06:40

Global-Roam status page updated with:

A key service has just been restored by Microsoft and data for most GR applications is now flowing again. We are continuing to investigate any flow on effects.

~08:24

Most services restored (some work remained to be completed to ensure no gaps in the data for any client through the affected period).

10:00

Global-Roam status page updated with:

Most services have now fully recovered but we are aware that NEMReview 7 / Trends will have some data gaps from the affected times which we are still resolving. Some Hosted MMS customers may also potentially have missing data from the affected time range.

Friday 1st Sept 2023

15:08 to 16:22

A range of customers identified as possibly directly affected by this issue contacted via email to follow up, with Subject Line:

GR Case 0000nnnn - Operating Incident (Thu 30th Aug 2023) and its impact on [CLIENT NAME]

With customer-specific Case Numbers attached.

Saturday 2nd Sept 2023

~14:30

Microsoft Releases its ‘Preliminary Post Incident Review’

This has been saved as a PDF and made accessible to our clients here.

Tuesday 7th Sept 2023

11:00

This Incident Report Page created (including this timeline) as a permanent record.

Thursday 21st September 2023

Microsoft’s Post Incident Review (PIR) made available from this page

Microsoft released its Post Incident Review. It is available at our downloads site for permanent access here.