I woke up feeling very warm. I thought I missed the alarm, but its just 3:23 am. Very sure I don’t need a potty break, extremely sleepy and obviously upset. Leaned over to see the AC (air-condition), and I found that it was off. I’m very sure its too warm and by now the AC should have kicked in. Mumbling, I woke my already tired and weary body and walked towards the thermostat to see what’s happening.
After blinking a few times to get my sight back to normal, I found that the Nest thermostat isn’t working. Walking back to my bedside table to grab my phone (I know, it’s a bad habit), I checked to see if the internet was down. WiFi seems up, checked my public IP (instead of good ol’ ping), everything seems okay. Google search shows up okay. Still with sleep in my head, I rummaged through my bedside drawer for the remote and turned it on. “This is too much work” – grumbled my half sleepy head. That’s enough for the night.
Woke up in the morning with a sleep hangover (yes, its possible, when you don’t have enough sleep), I was trying to figure out what happened. Turned on twtr and true enough, reports on Google Cloud services failure starts trickling in.
The horror! Google Cloud services went down?
*My panicked head screaming – The sky has fallen! The Sky has fallen!*
This pretty much explains why the thermostat went down. I wondered how may threat actors lost their C2 hosted on Google Services, how many IOT devices like the Nest Thermostat stopped working and other dependent service. If as an end user I am grumbling on the service availability, how about corporate organisations relying on Cloud services ?
Today’s organization rely heavily on cloud. Business today runs on cloud. Social media runs on cloud. Almost everything runs on cloud. Whether it’s server/virtual servers, serverless, functions (you name it), runs on cloud. (Disclaimer, most of my stuff also runs on cloud…)
But, is cloud outage a rarity? Well it depends on what you deem as rare. The Internet forgives, but never forgets. In August 25, 2013, AWS suffered an outage, bringing down Vine and Instagram with it. March 14, 2019, Facebook went down, bringing WhatsApp together in an apparent server configuration change issue.
The impact is obvious, business will lose revenue when the services goes down. Local franchise such as AirAsia, runs their kit mostly on Cloud. The impact is devastating, imagine booking of flights goes dark. So does a lot of other business. Hence this brings an interesting point: What is your business continuity plan if cloud goes down?
When I had this conversation a few years ago, most CIOs I spoke to boldly claim that their BCP is the cloud (we never reached the part about cloud and security because its most often dominated by the cost debate). There is no need, due to the apparent global redundancies of cloud infrastructure. The once-sleeping-soundly-at-night CIOs are now rudely awaken (just like me, due to the broken thermostat) that cloud no longer offers the comfort they can afford, after investing years of CAPEX (capital expenditure) and happily paying cloud services their monthly dues to show that their services are up.
Few points to note for those interested in even thinking about Cloud BCP. Yes, its time we take the skeletons out of the closet and start talking about this.
Firstly, can your application and services run a completely different cloud provider? Let’s look at the layers of services before we answer this question.
If you are running server images (compute cloud), it’s completely possible to run in a different cloud provider. You’ll need to be able to replicate the server image across cloud provider. You can archive the setup of your cloud server via scripts, create a repository to host your configuration files and execute the setup script to bring up the services in a separate cloud provider. The setup and configuration can be hosted in a private git/svn repository and called up when needed.
What about data? Most database services provide for replication and data backup services. For “modern” database services, data can be spread across multiple database for better data availability and redundancies.
The actual stickler for hybrid cloud is serverless/function based hosting. If the organization invests heavily in one particular cloud provider’s technology (without naming any particular provider), then it depends on the portability of that technology. If something common such as Python is used, the portability is pretty much assured. Technologies that are exclusive for a cloud provider will have issues of portability across different cloud providers.
Another question that needs to be answered is, how would you “swing” your services across different cloud providers? A common approach for internet availability is to use DNS services. Using DNS, the organization can change the location of services by changing the DNS records. This would allow seamless failover without having to change the URL. However, speed of failover will be determined based on the DNS TTL (time-to-live) configuration of that record. Too low, your DNS will be constantly hit with queries, but changes are almost instantaneous (usually a low TTL is around 15 to 30 minutes). Too high, your DNS infrastructure will have low traffic, but takes a long time before the failover actually happens. DNS based failover also creates administrative headache for firewall administrators as they have to change their approach from IP based to a DNS based access control list.
All of cloud isn’t just hot air. Moving towards Industry 4.0 (now I’m just throwing buzzwords around), Cloud adoption is definitely a core component of the technology strategy that each organisation needs to have. As times goes by, we find that even cloud is fallible, hence a proper approach towards Cloud is key in business continuity.
So, what’s your approach towards Cloud Services BCP?