Malaysian Airport Incident – A case study

Last updated: 4 September 2019

Acknowledgement

The information provided in this post was through crowdsourcing, thanks to the IT Security SIG set up by Nigel Rodrigues, contributed by many, with candid discussion which inspired me to write this article.

As this incident is still developing, this article will be updated with the latest information, and what you see here is a snapshot in time at the point.

The incident

On 21 August 2019, KLIA/KLIA2 airports begin to experience system and technical difficulties. The failure affected check-in counters, flight information display systems (FIDS), baggage handling, its airport mobile app, as well as payment systems which rely on its networks. Ostensibly, tempers and dissatisfaction among airport users were high.

Timeline

20 August 2019 – MAHB signed a MOU with Huawei on technology modernization.

21 August 2019 – KLIA/KLIA 2 reported system/technical issues affecting multiple systems in the airport. Initial news indicates a failure at the network equipment.

22 August 2019 – The Star reported that MAHB had informed that the situation will be resolved by 23 August 2019, as it has received new equipments to replace the existing ones and testing to be conducted on the same night.

23 August 2019 – MAHB updated their website (as at 6am) explaining that they are in the midst of stabilizing their system and had deployed additional buses to ferry the passengers to their respective terminals.

24 August 2019 – The Malay Mail reported that the situation had improved, passenger flow has been reported smooth with intermittent disruptions.

24 August 2019 – NACSA issued a statement to affirm that there were no cyber attacks which resulted in the network issue at KLIA/KLIA2.

25 August 2019 – The Malay Mail reported that KLIA/KLIA2 Operations has been restored to normal  based on a check by BERNAMA at 0930.

26 August 2019 – Ministry of Transport announces a panel to investigate the system failure of TAMS (Total Airport Management System). NACSA is one of the members who forms part of the committee.

26 August 2019 – MAHB in a statement was quoted saying that they are not dismissing the possibility of malicious intent that may have caused the incident.

26 August 2019 – Airport passengers were stating that its not a full service recovery, the information system was still down and the airports were operating at partial system availability.

27 August 2019 – Airlines seeking compensation from MAHB due to airport system down.

27 August 2019 – MAHB lodges police report over possibile malicious intent being cause of downtime.

28 August 2019 – PM orders probe to the airport downtime incident.

29 August 2019 – PDRM said to be probing 4 in relation to airport system failure, based on the report made by the IT division senior general manager.

30 August 2019 – AirAsia, a malaysian carrier is said to confirm not to sue MAHB due to the recent airport system failure.

2 September 2019 – Police said to have recorded 12 MAHB staff statements over the system failure incident.

3 September 2019 – 4 MAHB pioneer IT officers lodge counter police reports against MAHB. They were suspended, and claimed false accusation.

The cause

The details are vague, however the incident was pointed out to a faulty IP network switch which caused the IP network traffic to get to a grinding halt. The switch in question seems to be the core switch which processes all the network traffic for the airport.

Core switch is usually responsible for traffic between each segments, also acting as an aggregation point. Each area is connected via a smaller switch, point to an intermediate or aggregation switch which leads to the core switch. In this case, since the core switch is down, the segments are disconnected, through each machine shows as network connection as connected. Access to upstream, such as Internet, which is used for credit card payment gateway, is also interrupted as the traffic stops at the core since its not working.

Social Media Buzz

It was noted that a user, claiming to be a subcontractor to MAHB said that the network switch had been 17 years old and had not been changed since. This is unconfirmed, pending official statement from MAHB.

A report from Utusan Malaysia also have mentioned something similar, an excerpt mentioned here.

Related news

Just one day before the incident, on the 20 August 2019,  MAHB signed an MOU with Huawei “to drive MAHB’s digital transformation framework by enhancing connectivity and real-time information by connecting all stakeholders in one fully integrated digital ecosystem. The collaboration would also seek to set up a fully integrated network communication managed platform to manage above technology and integrated data to enable future big data analysis throughout the entire airport, further improving airport operation efficiency and reduce overall ICT cost.”

It’s probably sheer luck, the network equipment failed the very next day, seemingly catapulting the priority of this initiative.

Assessment

At this point, lack of official news seems to lead to multiple speculation. The first would be that the airport was under a cyber attack. This news was quickly quashed by NACSA, confirming that there were no attacks.

Another discussion lead to the belief that there should have been sufficient DR (Disaster Recovery) infrastructure to ensure business runs as usual. Assuming the social media news was right, most networks designed at that time would have had a typical star topology, whereby layer one connectivity would cascade back to a single core switch. Using Cisco as example, the spine and leaf architecture would have allowed the network to be redirected to a different core, should that had been the architecture. Spine and leaf is still a new concept, there may be others which any organization can adopt.

The Good

MAHB had been mobilizing their own staff, by recruiting and promoting initiatives to get them to assist the passengers during these trying times.  A poster was seen circulating on social media dated 22 August 2019 asking to assist the situation at KUL during peak hours (12 – 2pm & 4 – 10pm).

MAHB had exhibited strong understanding of the airport processes, being able to manage with manual processes and having pure manpower to handle the airports operations while the system was down.

Flipside

Assuming the theory about 17 year old network equipment is true, there can be 2 possible outcomes. The first, an overzealous CIO might end up saying “We should sweat our assets more, make sure you don’t buy anything new for the next 15 years! (BTW are we using the same brand as the airport?)”. Scary, to say the least! Worthwhile to remember that computer/network hardware are susceptible to degradation over time, even to the network copper wire, hence some data centers make it a point to “re-cable” their infrastructure periodically! Other views include “we’re not an airport, we wont need to worry about it”.

The second outcome is that investment on IT now becomes justifiable, as part of technology refresh. More prudent approach to technology life cycle emerges and that the MAHB story becomes a talking point at the Board level, raising the question of whether the assets in use are still (1) maintained, with necessary support and (2) prior to End-of-Life/End-of-Support. This is in line with managing tech debt, ensuring that such compounding interest doesn’t suddenly pop up!

Lessons learnt – so far

1. Have manual processes that will stand in if something fails. Can you operate without technology?

2. Understand the implications of tech debt. It’s a matter of time before it catches up and as an organization then pays the compounding interest. Reputational damage becomes severe and takes time to recover.

Reference

  1. Malay Mail – https://www.malaymail.com/news/malaysia/2019/08/23/mahb-network-failure-caused-systems-disruption-at-klia/1783638
  2. MAHB Official PR – https://www.malaysiaairports.com.my/media-centre/news/klia-network-disruption
  3. NACSA PR – https://www.nacsa.gov.my/doc/Press_Release_MAHB_KLIA_English.pdf
  4. TheStar MAHB Huawei MOU – https://www.thestar.com.my/business/business-news/2019/08/20/huawei-malaysia-to-support-mahb039s-digital-transformation
  5. Cisco Spine & Leaf Architecture – https://www.cisco.com/c/en/us/products/collateral/switches/nexus-7000-series-switches/white-paper-c11-737022.html
  6. Copper degradation – https://www.quora.com/Does-a-signal-sent-over-a-cable-network-degrade-over-time
  7. Potential malicious intent – https://www.thestar.com.my/news/nation/2019/08/26/mahb-not-ruling-out-malicious-intent-behind-klia-glitch
  8. The Star (22 Aug 2019)  – https://www.thestar.com.my/news/nation/2019/08/22/mahb-expects-klia-glitch-to-be-resolved-by-friday-morning-aug-23
  9. MAHB update (23 Aug 2019) – http://www.malaysiaairports.com.my/media-centre/news/latest-update-systems-disruption-klia-0
  10. The Malay Mail – Day 3 – https://www.malaymail.com/news/malaysia/2019/08/24/klia-systems-still-crippled-but-operations-improving-on-third-day-video/1783804
  11. The Malay Mail – Day 4 – https://www.malaymail.com/news/malaysia/2019/08/25/klia-operations-back-to-normal-after-system-outage/1784006
  12. The Star – https://www.thestar.com.my/business/business-news/2019/08/27/airlines-to-seek-mahb-compensation-for-delays-losses
  13. https://www.nst.com.my/news/nation/2019/08/516444/mahb-lodges-police-report-klia-systems-disruption
  14. https://www.thestar.com.my/news/nation/2019/08/28/pm-wants-probe-into-klia-systems-malfunction
  15. https://www.malaymail.com/news/malaysia/2019/08/29/report-police-to-probe-four-over-klia-systems-disruption/1785325
  16. https://www.nst.com.my/business/2019/08/517398/airasia-wont-sue-mahb-system-glitches-klia-and-klia2
  17. https://www.malaymail.com/news/malaysia/2019/09/02/klia-systems-disruption-police-record-statements-from-12-mahb-staff/1786552
  18. https://www.nst.com.my/news/crime-courts/2019/09/518461/4-mahb-it-officers-lodge-police-reports-against-their-employer-over

Do you need BCP for Cloud?

I woke up feeling very warm. I thought I missed the alarm, but its just 3:23 am. Very sure I don’t need a potty break, extremely sleepy and obviously upset. Leaned over to see the AC (air-condition), and I found that it was off. I’m very sure its too warm and by now the AC should have kicked in. Mumbling, I woke my already tired and weary body and walked towards the thermostat to see what’s happening.

After blinking a few times to get my sight back to normal, I found that the Nest thermostat isn’t working. Walking back to my bedside table to grab my phone (I know, it’s a bad habit), I checked to see if the internet was down. WiFi seems up, checked my public IP (instead of good ol’ ping), everything seems okay. Google search shows up okay. Still with sleep in my head, I rummaged through my bedside drawer for the remote and turned it on. “This is too much work” – grumbled my half sleepy head. That’s enough for the night.


Woke up in the morning with a sleep hangover (yes, its possible, when you don’t have enough sleep), I was trying to figure out what happened. Turned on twtr and true enough, reports on Google Cloud services failure starts trickling in.

URL: https://status.cloud.google.com/incident/compute/19003

The horror! Google Cloud services went down?

*My panicked head screaming – The sky has fallen! The Sky has fallen!*

This pretty much explains why the thermostat went down. I wondered how may threat actors lost their C2 hosted on Google Services, how many IOT devices like the Nest Thermostat stopped working and other dependent service. If as an end user I am grumbling on the service availability, how about corporate organisations relying on Cloud services ?


Today’s organization rely heavily on cloud. Business today runs on cloud. Social media runs on cloud. Almost everything runs on cloud. Whether it’s server/virtual servers, serverless, functions (you name it), runs on cloud. (Disclaimer, most of my stuff also runs on cloud…)

But, is cloud outage a rarity? Well it depends on what you deem as rare. The Internet forgives, but never forgets. In August 25, 2013, AWS suffered an outage, bringing down Vine and Instagram with it. March 14, 2019, Facebook went down, bringing WhatsApp together in an apparent server configuration change issue.

The impact is obvious, business will lose revenue when the services goes down. Local franchise such as AirAsia, runs their kit mostly on Cloud. The impact is devastating, imagine booking of flights goes dark. So does a lot of other business. Hence this brings an interesting point: What is your business continuity plan if cloud goes down?

When I had this conversation a few years ago, most CIOs I spoke to boldly claim that their BCP is the cloud (we never reached the part about cloud and security because its most often dominated by the cost debate). There is no need, due to the apparent global redundancies of cloud infrastructure. The once-sleeping-soundly-at-night CIOs are now rudely awaken (just like me, due to the broken thermostat) that cloud no longer offers the comfort they can afford, after investing years of CAPEX (capital expenditure) and happily paying cloud services their monthly dues to show that their services are up.

Few points to note for those interested in even thinking about Cloud BCP. Yes, its time we take the skeletons out of the closet and start talking about this.

Firstly, can your application and services run a completely different cloud provider? Let’s look at the layers of services before we answer this question.

XKCD - The Cloud

If you are running server images (compute cloud), it’s completely possible to run in a different cloud provider. You’ll need to be able to replicate the server image across cloud provider. You can archive the setup of your cloud server via scripts, create a repository to host your configuration files and execute the setup script to bring up the services in a separate cloud provider. The setup and configuration can be hosted in a private git/svn repository and called up when needed.

What about data? Most database services provide for replication and data backup services. For “modern” database services, data can be spread across multiple database for better data availability and redundancies.

The actual stickler for hybrid cloud is serverless/function based hosting. If the organization invests heavily in one particular cloud provider’s technology (without naming any particular provider), then it depends on the portability of that technology. If something common such as Python is used, the portability is pretty much assured. Technologies that are exclusive for a cloud provider will have issues of portability across different cloud providers.

Another question that needs to be answered is, how would you “swing” your services across different cloud providers? A common approach for internet availability is to use DNS services. Using DNS, the organization can change the location of services by changing the DNS records. This would allow seamless failover without having to change the URL. However, speed of failover will be determined based on the DNS TTL (time-to-live) configuration of that record. Too low, your DNS will be constantly hit with queries, but changes are almost instantaneous (usually a low TTL is around 15 to 30 minutes). Too high, your DNS infrastructure will have low traffic, but takes a long time before the failover actually happens. DNS based failover also creates administrative headache for firewall administrators as they have to change their approach from IP based to a DNS based access control list.


All of cloud isn’t just hot air. Moving towards Industry 4.0 (now I’m just throwing buzzwords around), Cloud adoption is definitely a core component of the technology strategy that each organisation needs to have. As times goes by, we find that even cloud is fallible, hence a proper approach towards Cloud is key in business continuity.

So, what’s your approach towards Cloud Services BCP?