We’ve encountered many businesses who told us they had backup set up, but when asked “do you know it’s working?” they had only their assumptions. In many areas of life, a false sense of security is worse than the lack of the same security and this is certainly one of them. You assume that the food you eat is safe and when a recall happens, many feel uneasy. Had they known that that risk existed in the first place, they may have avoided that food, or at least scrutinized more. Maybe you haven’t scrutinized your backup enough.
Resiliency, not Just Backup
Many will focus just on having a backup procedure run and won’t even think about why they do it. The question has to be asked “what am I protecting against?” This is why we call it a resiliency strategy, rather than a backup strategy. The reason these systems exist is so that your team can work. If they’re down all the time, then they don’t serve their purpose well. They need to be resilient so your team can rely on them and remain productive.
The backup solution you choose should protect against several distinct events. These include a site disaster such as a fire, flood, or theft of equipment, disk failures, and accidental deletion of data. Too many times have we encountered a backup solution that doesn’t meet these criteria. Sometimes it’s soon enough to intervene, but sadly many times we are the ones who have to clean up a mess than our predecessor has left behind.
Backup to Hard Drive
A widely used backup solution typically relies on backing up to an external hard drive. While this might be a sufficient solution for some if done correctly, it almost certainly isn’t done correctly. To meet the goals of the backup strategy, the disk used in this solution must be swapped regularly. Many times, the party responsible for swapping it doesn’t take it seriously. It may go for weeks, months, or indefinitely without being swapped. While not being swapped may protect against a disk failure, it does not protect against a site disaster. That disk will be just as destroyed as the rest of the equipment should a fire occur.
Failing to swap the disk regularly also increases the risk of a complete failure. If you do experience a disk failure on the system being backed up, your backup disk could literally be the only place that your data exists. What happens if that disk fails while trying to restore the data? If the last time the disk was swapped was six months ago, then six months’ worth of work would be lost.
Another very real threat is ransomware. The perpetrators behind ransomware know about your backups. The most common types of backup solutions are automatically targeted and if you’re unlucky enough to have an active invasion of your system, the people behind it will find your backup disks.
Lastly, how do you know it’s working? Many people assume that it is working. It may have been set up properly to begin with, but things break over time. If you don’t actively know that it is working, you might be sitting on a ticking time bomb. Whoever is responsible for the backup should either be checking it regularly or get alerts when it is successful. It might sound strange to get alerts when it’s successful, rather than when it fails, but if the procedure fails to run at all, then you likely wouldn’t get a failure alert. Backup systems only can send a failure alert when they attempt to run but run into a problem.
Backup should be your plan for when things break down. If your staff relies on a system to do their work, you should have mitigations in place that keep it working relative to how important it is to keep it working. A system that 3 people occasionally need might not be worth thinking too hard about, but one that 20 people constantly need should be well-thought-out. How much does an hour of downtime for your organization cost?
For resources provided by in-house servers, this is typically accomplished with redundant parts. In any company that we support, we practically demand redundant disks, or at least make management very aware of the risks that come by not having them. The disks in a computer system are the most likely part to fail, and since they’re where all of the data resides, having a disk failure immediately leads to an extended downtime.
Other redundancies might include power supplies, batteries, electrical circuits, and network cards, among others. In most small installations it’s usually sufficient to stop at redundant power supplies, as they’re the second most likely part to fail. The rest certainly can fail, but it’s a much rarer event. Some systems even support redundancy between entire servers, but in many small installations, it’s more economically viable to have a mission-critical hardware warranty to have the server parts delivered in hours, rather than days under a typical one.
No matter what redundancies you have in place, you need to know when one of the constituents of the redundancy fails. If you have two drives, but don’t know that one of them has failed, then it only delays the entire failure instead of preventing it.
Does your organization rely on internet connectivity? How much internet downtime is acceptable? These questions are easily overlooked, and the consequences can be substantial. Both supporting your teleworking staff and using cloud services rely on having internet connectivity. If your office’s internet is down, your telework staff may be completely down, and your office staff cannot access any cloud services. In considering how to prevent downtime, you have to consider your options and whether they’re worth it to prevent the associated downtime.
The responsibility for resiliency in a given cloud service either falls completely to the service provider, or a combination of the service provider and the organization’s IT support. The massive cloud services like Microsoft 365 and G Suite have many built-in resiliencies by default, and further have granular options to meet the needs of just about any organization. There’s very little need to produce a traditional backup of your data on those systems. The only thing you’d be protecting against is a lack of faith in those organizations, and if you didn’t have that faith, you probably shouldn’t use them to begin with.
Regarding other “cloud services,” it could be much less clear. There’s somewhat of a joke in the IT industry that sums up the concern well: “there’s no such thing as the cloud, it’s just someone else’s computer.” This is true of all cloud services, including the massive ones, so it all comes down to how the organization behind the service operates it. There are unfortunately many line-of-business applications offering a “cloud version” of their application that fail to meet a basic standard for resiliency. Some of them even fail to meet a basic standard for security.
When considering services like these, questions regarding their resiliency strategy must be asked. All the things previously described above can happen to them as well. Don’t let your company become a victim to a disaster on their side like a flood, fire, or disk failure without proper mitigations in place. Even if the disaster happens on their system, it’s still your data that gets lost or at least becomes inaccessible leading to downtime.
Did Any of This Resonate with You?
Some of these situations should sound familiar. If you’re not confident that your important systems are resilient, then you should seek professional IT support to examine them. Take the recommendations seriously and weigh the cost of the mitigations against the cost of the downtime not having them may cost. All these systems exist so that your company can be productive. If they fail at that, then you need to make changes.