Backup Monitoring: Part 3—Triaging Your Devices

In part one and part two of this backup monitoring series we covered the basic tenants of backup success. For example, how to look at a device Timestamp (TS) to determine if it is online, sort devices by OS type (OT) and Creation Date (CD) and see when the Total Last Successful Session (TL) occurred. Success metrics such as these are time sensitive. They remain valuable only as long as they’re still relevant—and they lose value as time passes. But you can use these metrics to identify things you should work to correct now. For example, there is minimal value in troubleshooting why a device had five open file errors when it’s been offline for more than 72 hours. However, you can work to get that device online. Once it is, you can run a new backup job before beginning to troubleshoot failures, error counts, or other configuration issues.

Following this premise, I’ve summarized six steps to help you identify and triage problem backup devices:

Prioritization
Connectivity
Errors
Selections
Synchronization
Configuration

1. Setting priorities

While you should monitor all protected devices to ensure you’re meeting your agreed upon service-level agreements (SLAs), not all of those devices warrant the same response. I suggest you prioritize management of those devices based on which ones are most important. It makes sense to start with the most business-critical devices from your higher profile clients.

For example, a server is typically more important than a workstation. The same can be inferred for a CEO or business owner’s laptop. Make sure to prioritize the devices of your larger, high-profile customers over smaller, less strategic ones. Use this (or other criteria you define as a baseline) to figure out which devices need your attention first. But don’t make it your only guidance. If one of your customers is down, then you need to prioritize data recovery over backup success and get them back online and running.

2. Maintain connectivity

Try not to get sidetracked looking at backup failures and error counts. Instead, first work on identifying the devices that are offline. If a set of devices aren’t connected to the cloud, they won’t backup. There is minimal value in diagnosing backup issues you can’t immediately address. Flag or log the backup error to review later since it’s irrelevant until you can re-establish connectivity to the device. It could be offline because the system is offline, the backup software is uninstalled, or services have stopped or are blocked. It’s possible the firewalls are preventing access, users are on vacation, or the network is simply down. Start by pinging the device, confirm the backup agent is installed, restart the backup services, check connectivity to the management server, and rule out things like geo restrictions, firewalls, or antivirus that could block access. Once you have the device back online and the backup agent responding, run a new backup job to protect any new data and then see if any of the prior errors are still present.

3. Addressing errors

Address total failures before you address partial failures since it’s better to have some backup data than no backup data. Failures could be for the entire device or just a single data source. Large error counts are commonly indicative of permissions issues, open or locked files, offline files, insufficient VSS snapshot resources, unplanned system restarts, etc. Large error counts might seem to be more important than small error counts, but that’s not always the case. Small error counts could be just as critical. For instance, it’s a crucial error if you can’t access the entire C:\ drive or there are no selections made for a data source.

Look for error trends across multiple devices and over time. Determine if the error counts are consistent from day to day, or if they only occur on certain days or times when other maintenance windows and tasks happen to be running. See if the impacted devices are part of the same domain or behind the same external IP address. Check to see if any third-party backup or security software is also using VSS.

4. Selections

By this point your backup success rate should start to increase. You can now begin confirming your device and data sections. Ask yourself if you are protecting all your customers’ important data or if they’re suffering from data under-protection. Is your environment configured to automatically protect external volumes as they are added to a system? Are you monitoring for the addition of new users and systems to the network? Consider enabling device discovery, automated deployment, and some form of backup profile to save yourself the installation efforts.

Over protection can also be detrimental to your backup success. Take Microsoft SQL and Microsoft Hyper-V as examples. You should confirm the application or another tool isn’t also performing snapshots, dumps, replication, or backups of the same data. Without proper setup, your backups could interfere with each other. Choosing to exclude data that is redundant or has zero recovery value can also help improve backup success, save bandwidth, reduce backup sizes, and potentially reduce costs. Look for and remove duplicate selections across multiple data sources. Setup exclusion filters that prevent you from backing up items such as temporary files, dump directories, media libraries, patch and AV updates, etc.

5. Synchronization and performance

Your backups may be successful, but are they completing within the desired timeframe? If you’re using a Local Speed Vault, is the data fully synchronized locally and off-site? When was the last successful off-site backup completed? Are you throttling uploads or downloads? Decreasing selections, adding exclusions, or adjusting bandwidth throttling can help you fine tune performance. But it’s also possible you simply have too much changing data to support the desired backup schedule frequency. You may want to look at your session logs to see where the longest and largest backups happen. Consider adjusting your schedules to run less frequently throughout the day or week, and with less overlap from other backups on the same network. If that’s not an option, it may be time to consider adding more bandwidth at this site.

6. Configuration and retention

Even the best technicians sometimes deploy backup in a hurry. When that happens, things often get missed. Even if it isn’t deployed in a hurry, things can get overlooked. If your backups settings (selections, security, retention, etc.) aren’t consistent within a client (or across your clients) then you may find yourself guessing at whether you’re able to meet your agreed upon SLA or SLO. Data retention can be critical when it comes to ransomware recovery or compliance and your customers may need to have archives enabled to retain data for more than a default 28 days. Data security is also important. Have you audited your devices to ensure you’ve setup the desired security measures, including proxies, remote access, and GUI passwords? Have you validated you have the correct encryption keys recorded?

7. Staying healthy

It may take some time initially, but once you get your backup dashboard to the point it’s predominantly green with successes you’ll find it easier to maintain. It’s important to remember you’re not alone. Don’t hesitate to reach out to your technical support, sales engineering, account manager, customer success team, or Head Nerd if you need assistance with error resolution.

Eric Harless is the Head Backup Nerd at SolarWinds MSP. Eric has worked with SolarWinds Backup since 2013 and has over 25+ years of data protection industry experience in sales, support, marketing, systems engineering, and product management.

You can follow Eric on Twitter at @backup_nerd

Other blogs in this series

Event

May 2 2024, 13:00 - 14:30 EDT (19:00 - 20:30 CEST)