A week ago today, Optus customers nationwide found themselves waking up with no phone service, and no internet if their home access is provided by the ‘Yes’ telco, too. The outage was mayhem, and the news cycle was absolutely dominated with call backs to September 2022 when Optus suffered a data breach and made people shitty over how it handled communications during this time.
It was a similar case last week, the whole communications thing, that is, which, made sense to a degree considering Optus staff mostly use Optus services, therefore even the company’s CEO had to deliver an apology over WhatsApp. It was bizarre.
As the AFR highlighted, the outage disrupted 10 million customers, shut down Melbourne’s trains, and stopped some people making calls to emergency services. It was a reminder just how reliant we are on access to phone and internet services.
Comments aside (and I don’t want to engage in the rhetoric around being too reliant on technology), there were real-world issues with people being without access. The Sydney Morning Herald also discussed how the outage was a glimpse into just how vulnerable Australia’s hi-tech infrastructure is, and what might happen if an enemy decided to exploit these weaknesses against us, which is the stark reality of the world we live in.
While the outage was rectified late afternoon on Wednesday, the brand’s reputation (that which the telco had clawed back in the last 12 months) was slithering away. Optus apologised and started offering details to some news outlets on what actually caused the outage.
The Australian Financial Review was given an interview with CEO Kelly Bayer Rosmarin, who told the publication a “technical network fault” caused the outage, but would not specify what exactly it was, or where or how it occurred. Gizmodo Australia was told we could not interview with the CEO, by the way.
Nonetheless, the takes kept coming and more and more of the ball of string unravelled and we learned what had happened to cause the massive disruption. On the Optus website, the telco released this statement:
“At around 4.05am Wednesday morning, the Optus network received changes to routing information from an international peering network following a software upgrade. These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.
“The restoration required a large-scale effort of the team and in some cases required Optus to reconnect or reboot routers physically, requiring the dispatch of people across a number of sites in Australia. This is why restoration was progressive over the afternoon.
“Given the widespread impact of the outage, our investigations into the issue took longer than we would have liked as we examined several different paths to restoration. The restoration of the network was at all times our priority and we subsequently established the cause working together with our partners. We have made changes to the network to address this issue so that it cannot occur again.
“We are committed to learning from what has occurred and continuing to work with our international vendors and partners to increase the resilience of our network. We will also support and fully cooperate with the reviews being undertaken by the Government and the Senate.”
A routine software upgrade by a third-party infrastructure provider.
So… what actually happened?
It seems the message from Optus on the root cause of the outage was that various routers automatically shut down as a result of a failsafe mechanism being triggered. So, the software update kicked in, but the routers shut down as the failsafe was triggered by an increase in addresses being propagated through the network.
It also seems this occurred as traffic was diverted away from an international peering link in a partner’s network that underwent a software upgrade. Which is to say the external upgrade itself is not the root cause. So that’s a little less “a software update caused an outage” but a little more “yikes this still shouldn’t have happened”.
The SMH dug a little deeper yesterday, writing that “according to publicly listed information (which may not be exhaustive), the part of Optus’ network affected by last Wednesday’s outage peers with parent company Singtel’s network in Singapore; China Telecom; the US-headquartered global content delivery network Akamai; and Global Cloud Xchange, owned by Jersey-based 3i Infrastructure and formerly known as Flag Telecom.”
Gizmodo Australia would’ve loved to provide you with updates as this all unfolded, but our resourcing constraints unfortunately meant we had to leave the rest of the country’s talented tech writers to handle it. It would be great to probe a little deeper into this explanation, that’s for sure. While we’re not necessarily adding anything to the conversation here, hopefully this put everything into the one place for you in a way that was relatively easy to consumer.
Optus is offering customers an apology in the form of free data, so head over here to work out how to claim it.