We’re in this pandemic for over a year. Malaysian government has recently issued a decree for a “total lockdown”, which requires everyone to work from home. Only selected sectors that has been predefined, or which has approval from the ministry is allowed to operate.
In order to obtain the approval, one has to fill in the details make the submission to MITI, through its CIMS portal. The portal was developed by MARII, an agency under MITI. In this case study, we look at the portal’s operational effectiveness and a view into governmental online services or its digitalization process.
The lockdown was announced to be effective 1 June 2021 till 14 June 2021 (ignoring the fact that Health Director General mentioning that it will take between 3-4 months for the lockdown’s effectiveness to bear fruit). From the onsite, doubts shrouded over the process of applying, with multiple ministries involved in the approval process. After much ado, it was decided that MITI’s CIMS3.0 will be the single point of application.
CIMS in production
The system was being used in the previous lockdown, however not much documentation about its effectiveness was made available. Hence, during this “supposedly final” lockdown as vaccination drive intensifies, it was imperative for me to document the effectiveness of this system.
The system seems to have been tiered based on current practices. The front-end is guessably using Varnish Cache server, to speed up the page delivery to the clients. This was discovered as there were constant Varnish errors that appeared throughout the usage of the system.
The backend seems to be via an nginx web server which would have acted as a proxy to the actual web application or web services. This was also evident from the error page seen from the errors displayed.
One interesting to note is that while the nginx system was in the backed, it displayed the version number 1.16.1. The current version for nginx is 1.17.0 while 1.16 branch is only maintained for fixes. The version of nginx used has been reported to be vulnerable to CVE-2021-23017 which results in RCE (remote code execution). Based on the CVE creation date, the vulnerability was known as far as January 2021.
It seems that MITI seems to have their ears on the wire. Based on the feedbacks provided, MITI quickly set up specific error pages to mask the underlying daemon messages. Good first move, but damage done, and information now made available.
Looking within the layers
While it’s only the system owners knowledge how the application is tiered, its fairly obvious that it follows the standard 3 tier architecture, with front load balancers in the form of Varnish, and backend with nginx. Architecture may be sound (based on assumptions), but where the system fails is at the capacity management. This isn’t just unique to MITI, but also other government agencies (for another would be the JKJAV/CITF web failure for the vaccine registration, and the occasional MySejahtera errors that appear).
Reviewing the errors that was constantly displayed, it can be concluded that CIMS was running out of capacity managing the sudden workload that was demanded of it. No actual numbers were published on the amount of traffic or number of requests received, but the errors indicate that the backend infrastructure was unreachable. At times, even the login page was not reachable, indicating that Varnish itself had run out of steam.
Strategy for service
We’re seen what has happened to the service, now let’s explore what can be done to improve the services.
The key issue is that everyone is using the service at the same time. All states, all industries, all at the same time. One way to prevent overuse is to tier the request based on a set parameter. For example state. So state A, B C makes request for a certain days, followed by D E F. Or divided by certain industries. This allows effective use of the resources without scaling up too much. However, this requires adequate time before the lockdown process comes into effect. Unfortunately, due to the circumstances in play, this cannot be done.
Government agencies in Malaysia is managed centrally by MAMPU. Creation of a central compute pool (aka private cloud) will allow services to be deployed on this cloud. The govt cloud will allow any ministry to tap onto a larger poolset of resources, when it is anticipated that a certain services will grow beyond its average use. This cloud can be an on-prem or hosted in an existing cloud provider.
It is important that any govt digital services is built with scalability at heart. Making an app work is primary, but making it scale is paramount. Service metrics should be introduced to identify performance level, and scaling capabilities (e.g. triggering an ansible playbook to provision a new web server and assigning it to the default pool) should be made as primary functionality in tandem with usage functionality (to cover both scale up and scale down abilities). One sample resource that can be tapped is that Google published a free book called “Building Secure & Reliable Systems – Best Practices for Designing, Implementing and Maintaining Systems” published by O’Reilly Media.
This issue isn’t new, and isn’t going to go away. As a nation that is focusing on digitalization, service reliability will be critical in ensuring success of the implementation. Proper service architecture, engineering, implementation and operations management is critical in ensuring “always-available” service for its customers.