What is data deduplication?

According to Wikipedia, Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. I didn’t get it at first, sounded like a complicated technique… Yet with some hands-on experience and experiencing what incredibly technology it is, let me explain it in a very simple way.

To understand what Deduplication is, we should first take a look at Data duplication! Imagine a mail system with 1.000 users. User A sends a mail with an attachment with a size of 1MB to the other 999 users. This means the total size of the attachments is 1GB (999 times from the inbox + 1 time in the Sent items). So every time a backup-job runs, all these 1000 identical files are saved.

This is Data Duplication, an individual file has multiple DUPLICATES (or instances) on disk.

Now, let’s dig a little deeper!

When a special kind of backup software verifies these files, and sees ‘Hey, this file is recurring 1000 times!’, it won’t just transfer the file. It will transfer the file 1 time, and replace the 999 other instances with a shortcut!

This is data DE-duplication, hence: all the duplicates are removed.

But, what can Data Deduplication do for me?

Let us go back to our email; imagine you don’t have a software capable of deduplication. The mail gets backed up: size on disk: 1GB. 7 days later, there is a next full backup job. Again 1GB on disk.

Now imagine you upgrade to de-duplication software.

The mail gets backed up: size on disk: 1MB. 7 days later, there is a next full backup-job. Extra size needed on disk: 1MB.

Data De-duplication effectively reduces the storage used by backups from 2GB to 2MB! But the storage space is not the only thing saved; also the network gets less traffic!

_{Image taken from http://www.enterprisestorageguide.com/how-data-deduplication-works}