The Complexity Behind The Simplicity
In my last blog, we explored the engineering challenges of building products delivered ‘as a service’. Developers are constantly making trade-offs between simplicity (sometimes at the expense of functionality) and functionality (sometimes at the expense of added complexity). As enterprises operate multiple datacenters, IaaS, PaaS and SaaS, protecting all of these workloads becomes a very complicated problem. A problem which can actually only be solved by a very complicated solution. The key is to deliver that solution ‘as a service’ without burdening the user with the expense of complexity.
Key Tenets for “As A Service”
Upgrades and troubleshooting are some of the most painful and complicated tasks for systems administrators in the enterprise. At Clumio, we decided to completely remove the tasks from the administrator. This is possible only if the entire system is designed to be restartable and upgradable and has extreme observability. These are things that cannot be added later as an afterthought but instead must be foundational concepts from the beginning of development.
As a reminder, observability means our software stack can and must detect failures and collect metrics and logs to provide an accurate view of the entire system. Given the amount of data and transactions typically involved in backup, this seems like an impossible task. So how do we do it?
The diagram on the right is what we call workflow. A workflow is the execution plan and runtime that drives an operation. Every shape in the workflow is a step function that executes a piece of work that is idempotent, versioned, configurable and monitored. Each arrow describes the control-flow (what happens if a step executes successfully, fails or is canceled).
This particular diagram corresponds to a workflow executing a backup of a VM with 2 disks. Everything that we do (backup, restore, data post-processing, metadata collection, analytics, clean-up, etc) is driven by a workflow.
Don’t worry if this sounds and looks too complicated. The beauty of our secure, backup as a service for the enterprise is that all of this is for the Clumio team to understand so that we can provide a great service experience to our customers. The whole point of our product is that you don’t need to worry about it and just consume all of this as a service, instead of you managing it.
Enabling a New Support Model
Because the workflow is broken down into granular step functions that are individually monitored, the engineering and support teams can get an alert if anyone of the steps fails. Delivered through Slack and SMS, every alert is issued with the corresponding workflow diagram. Color-coded states allow us to easily identify the exact step that failed, the error message, the input and output of all previous steps that led to the failure and also the log-id used by the team to search all logs related to the operation across the system.
Basically, everything required to identify the problem and fix it – fast. None of this is exposed to our customers but it is an essential part of the design that allows us to service our customers and their workloads.
Every workflow can be interrupted at any point and resumed from the last known step. That means that an upgrade for a new product capability for us is no different than recovering from a restart – it is all part of the platform. We don’t need to ask our customers for an upgrade window or planned downtime, it just happens automatically without any downtime or user intervention.
During our beta cycle, we had a customer that tested the service with an unsupported configuration. Once we received an alert that something had gone wrong, it was obvious that it was an unsupported configuration – because we had observability built-in. We happened to already have the fix in testing and it was almost ready to be deployed. We went ahead and upgraded the service the next day and re-ran the failed workflows on behalf of the customer. This anecdote depicts the benefit of an authentic SaaS:
– Clumio was able to detect the failure, before the user
– Clumio was able to root cause the issue, without any user intervention
– Clumio was able to automatically upgrade the service without any user intervention
Leveraging the Cloud the Right Way
Think about how different this experience is compared to what enterprises have been doing for years to design, deploy, operate and upgrade backup systems in the data center. The public cloud is about agility and therefore complexity becomes the enemy of everything. Simplicity combined with proactive service is the winning formula but making that a reality is extremely difficult.
In my next blog, we’ll dig in deeper to compare and contrast the cloud backup experience when it is delivered the right way, and when it is not.