Data Deduplication Overview

Backup technology has seen a huge number of advancements over the past couple of decades, but few have been as significant as the development of data deduplication. Data deduplication, which removes duplicates of data entries to save storage space, has been around in some form since the 1970s. At that time, redundant data was identified by clerks who went through data line by line, manually searching for duplicates.

In the years since then, as quantities of data have grown exponentially, the process has become automated. In fact, data growth is so rapid nowadays that even the newest storage solutions struggle to keep up—which is why data deduplication is more important than ever. As a managed services provider (MSP), understanding what data deduplication is and how it works can help you optimize your storage capacity, saving you significant amounts of money in the long run.

What Is Data Deduplication?

Working as an MSP, you may encounter customers who ask, “What is data deduplication and compression?” Data deduplication is the process by which redundant data is removed before a data backup. It allows for the storage of one unique instance of all the data within a database, without any copies needlessly taking up space. Once the redundant copies of data are removed, data deduplication gives you the option to compress the single copies of data that are stored to save even more space.

It is important to keep in mind that while you can compress data through data deduplication, the deduplication process is distinct from regular data compression. In the latter, compression algorithms identify redundant data within individual files before encoding that data more efficiently. Deduplication, on the other hand, inspects large volumes of data, identifying large sections (even entire files) that are the same. It then replaces these duplicates with a single shared file.

For example, if an email system has 200 instances of the same file attachment, data deduplication will clear the redundancies in favor of one saved copy of the attachment. This results in a deduplication ratio (discussed below) of 200:1. If you imagine that each instance of the attachment was 1MB, then you will have reduced your storage requirements by 199MB.

How Does Data Deduplication Work?

There are a number of different data deduplication processes that influence the way data deduplication works. In essence, data deduplication functions by creating and comparing groups of data called “chunks.” However, there are several variables that dictate how each of these different data deduplication processes work.

You can either run inline deduplication or post-processing deduplication. The difference between the two is that with post-processing deduplication, duplicates are removed after the data has already been written on a disk. With an inline process, on the other hand, deduplication is run as the data is being written into the storage system. With data deduplication software, you can run both post-processing and inline data deduplication to maximize savings.

No matter which you use, the basic steps of deduplication operate the same way. In order for data to be deduplicated, it is first broken down into chunks. These are typically one or more contiguous blocks of data. Every deduplication system creates chunks differently, but no matter which way the chunks are broken down, the process of comparing the chunks is largely the same.

Once the data is broken down, the analysis process begins. Each individual chunk is run through an algorithm that creates a hash—essentially a long series of numbers and letters that represent the data contained in the chunk. Given that even the smallest change to the data in a chunk causes the hash to change, two different chunks that result in matching hashes are considered identical. Whenever a chunk is found to be redundant, it is replaced by a small reference pointing to the stored chunk.

Which Data Deduplication Method Is Right for You?

A further distinction between data deduplication methods is between target and source deduplication. The basic distinction between the two is that target deduplication occurs near the location where the data is stored, whereas source deduplication occurs near where the data is created.

In target deduplication, the process of removing duplicates occurs when the data reaches the target storage device. Once the data actually reaches the target, deduplication can either be done before or after the data is backed up to the device. That means the server is unaware of any deduplication efforts because the chunking and comparison work occurs at the target. This is generally the more popular method, though it does have some disadvantages compared to source deduplication.

In source deduplication, the process of removing redundant data occurs at the source instead of at the target. It typically takes place within the file system itself, where periodic scans of new files occur. The resulting hashes are then sent to the backup server for comparison. If the server finds the chunk to be unique, it is then transferred to the backup server and written to the disk. But if the server finds any identical hashes already in the system, then the chunk is not unique and does not get transferred to the backup server. This saves both storage and bandwidth.

One common criticism of source deduplication is it uses a lot of CPU power—more than target deduplication. However, given the significant reduction in the amount of CPU needed to transfer backups, the increased amount of CPU used in the source deduplication process is typically offset in the long run.

The main difference that needs to be considered when determining the right data deduplication method for you is in how the deduplication processes actually play out. With the target deduplication method, you need to buy target deduplication disk appliances. These appliances need to be present everywhere you’re going to back up. While this can be costly, it offers the additional benefit of allowing for incremental deduplication. With incremental deduplication, you use the same backup software, but simply change the target. It also lets you conduct target deduplication with almost any backup software, as long as it is one that the appliance supports. That means that you don’t need to embark on a wholesale replacement of your entire backup system.

With source data deduplication, you typically do need to undergo a wholesale replacement of your entire backup system. However, unlike target deduplication, you don’t need an appliance that’s local to each device you want to back up. Since you can back up from anywhere with source deduplication, it is the ideal data deduplication method if you have a lot of remote devices like laptops and mobile devices.

Data Deduplication in the Cloud

The increased use of the cloud is opening up amazing possibilities for data deduplication. Some of the best data deduplication ratios can often be achieved through virtual server environments. This is because when it comes to virtual environments there is a huge amount of redundant data that can easily be removed through a data deduplication process.

With more and more companies moving to virtual cloud environments for their data storage, data deduplication is also opening the door for new possibilities with stored data. In particular, it is improving data governance. By providing historical context for information, data deduplication is improving IT’s ability to understand data usage patterns. This understanding can then be used to proactively optimize data redundancies across users in distributed environments.

What Is a Deduplication Ratio?

As previously mentioned, a data deduplication ratio is the comparison between the original size of the data and its size after the redundancy is removed. It is essentially a measure of the effectiveness of the deduplication process. As the deduplication ratio increases, the deduplication process returns comparatively weaker results, given that most of the redundancy has already been removed. For example, a 500:1 deduplication ratio is not significantly better than a 100:1 ratio—in the former case 99.8% of data is eliminated, versus 99% of data eliminated in the latter.

The factors that have the greatest influence on the deduplication ratio are:

Data retention. The longer that data has been retained, the greater the likelihood of finding redundancy.
Data type. Certain types of files are more likely to have high levels of redundancy than others.
Change rate. If your data changes frequently, you will likely have a lower deduplication ratio.
Location. The wider the scope of your data deduplication efforts, the greater the likelihood of finding duplicates. For example, global deduplication across multiple systems typically yields a higher ratio than local deduplication looking at a single device.

Why Is Data Deduplication Important?

Data deduplication is important because it significantly reduces your storage space needs, saving you money and reducing how much bandwidth is wasted on transferring data to/from remote storage locations. In some cases, data deduplication can reduce storage requirements by up to 95%, though factors like the type of data you are attempting to deduplicate will impact your specific deduplication ratio. Even if your storage requirements are reduced by less than 95%, data deduplication can still result in huge savings and significant increases in your bandwidth availability.

There is no single right way to engage in data deduplication. Luckily, there are many different variables that can help you find the best approach for your environment. From inline to post-processing to target to source deduplication, there are a variety of approaches that can all result in significant decreases in your storage capacity needs. This, in turn, results in significant cost savings for your organization.

Blog

18th April, 2024

Patch Tuesday April 2024: Lots of Fixes for Secure Boot and Remote Code Execution Vulnerabilities

April’s Microsoft Patch Tuesday brings a bumper crop of fixes that will be keeping patching teams busy this month! Lewis Pope takes a look under the hood.

Event

May 2 2024, 13:00 - 14:30 EDT (19:00 - 20:30 CEST)